Humanity's Last Exam

Reasoning

A benchmark of 2,500+ expert-level questions across dozens of academic disciplines, designed to be the hardest public AI evaluation. Questions contributed by domain experts worldwide.

Models Tested

Best Score

44.4

Median Score

Scoring: accuracy

Introduced: 2025-01

Maintainer: Scale AI / Center for AI Safety

Leaderboard (9 models)

#	Model	Developer	Score
🥇	Gemini	Google DeepMind	44.4
🥈	Claude Opus 4.5	Anthropic	43.2
🥉	o3	OpenAI	19.8%
4	Gemini 2.5 Pro	Google DeepMind	18.8%
5	Gemini 2.5 Flash	Google DeepMind	11%
6	DeepSeek R1	DeepSeek	8.6%
7	o1	OpenAI	8%
8	Claude 3.5 Sonnet	Anthropic	4.1%
9	GPT-4o	OpenAI	3.3%