Skip to content
Longterm Wiki

BBH

Reasoning
BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.
Models Tested
9
Best Score
93.1%
Median Score
86.8%
Scoring: accuracy
Introduced: 2022-10
Maintainer: Google / Stanford

Leaderboard (9 models)

#ModelDeveloperScore
🥇Claude 3.5 SonnetAnthropic
93.1%
🥈Llama 3.1Meta AI (FAIR)
89%
🥉DeepSeek V3DeepSeek
87.6%
4DeepSeek ModelsDeepSeek
87.5
5Claude 3 OpusAnthropic
86.8%
6Gemini 1.5 FlashGoogle DeepMind
85.5%
7Gemini 1.0 UltraGoogle DeepMind
83.6%
8GPT-4OpenAI
83.1%
9Mistral Large 2Mistral AI
78.8%