TAU-bench
AgenticTool Agent User benchmark from Sierra AI — evaluates AI agents on realistic customer service scenarios requiring multi-turn tool use across airline and retail domains.
Models Tested
3
Best Score
69.8%
Median Score
58.8%
Scoring: percentage
Introduced: 2024-06
Maintainer: Sierra AI
Leaderboard3 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude 3.7 Sonnet | Anthropic | 69.8% |
| 🥈 | Claude 3.5 Sonnet | Anthropic | 58.8% |
| 🥉 | GPT-4o | OpenAI | 48% |