Skip to content
Longterm Wiki

BBH

Reasoning

BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.

Models Tested
8
Best Score
93.1%
Median Score
86.15%
Scoring: accuracy
Introduced: 2022-10
Maintainer: Google / Stanford

Leaderboard8 models

#ModelDeveloperScore
🥇Claude 3.5 SonnetAnthropic
93.1%
🥈Llama 3.1Meta AI (FAIR)
89%
🥉DeepSeek V3DeepSeek
87.6%
4Claude 3 OpusAnthropic
86.8%
5Gemini 1.5 FlashGoogle DeepMind
85.5%
6Gemini 1.0 UltraGoogle DeepMind
83.6%
7GPT-4OpenAI
83.1%
8Mistral Large 2Mistral AI
78.8%