BBH

Reasoning

BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.

Models Tested

Best Score

93.1%

Median Score

86.8%

Scoring: accuracy

Introduced: 2022-10

Maintainer: Google / Stanford

Leaderboard (9 models)

#	Model	Developer	Score
🥇	Claude 3.5 Sonnet	Anthropic	93.1%
🥈	Llama 3.1	Meta AI (FAIR)	89%
🥉	DeepSeek V3	DeepSeek	87.6%
4	DeepSeek Models	DeepSeek	87.5
5	Claude 3 Opus	Anthropic	86.8%
6	Gemini 1.5 Flash	Google DeepMind	85.5%
7	Gemini 1.0 Ultra	Google DeepMind	83.6%
8	GPT-4	OpenAI	83.1%
9	Mistral Large 2	Mistral AI	78.8%