Skip to content
Longterm Wiki

TAU-bench

Agentic

Tool Agent User benchmark from Sierra AI — evaluates AI agents on realistic customer service scenarios requiring multi-turn tool use across airline and retail domains.

Models Tested
3
Best Score
69.8%
Median Score
58.8%
Scoring: percentage
Introduced: 2024-06
Maintainer: Sierra AI

Leaderboard3 models

#ModelDeveloperScore
🥇Claude 3.7 SonnetAnthropic
69.8%
🥈Claude 3.5 SonnetAnthropic
58.8%
🥉GPT-4oOpenAI
48%