BBH
ReasoningBIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.
Models Tested
8
Best Score
93.1%
Median Score
86.15%
Scoring: accuracy
Introduced: 2022-10
Maintainer: Google / Stanford
Leaderboard8 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude 3.5 Sonnet | Anthropic | 93.1% |
| 🥈 | Llama 3.1 | Meta AI (FAIR) | 89% |
| 🥉 | DeepSeek V3 | DeepSeek | 87.6% |
| 4 | Claude 3 Opus | Anthropic | 86.8% |
| 5 | Gemini 1.5 Flash | Google DeepMind | 85.5% |
| 6 | Gemini 1.0 Ultra | Google DeepMind | 83.6% |
| 7 | GPT-4 | OpenAI | 83.1% |
| 8 | Mistral Large 2 | Mistral AI | 78.8% |