Skip to content
Longterm Wiki

Humanity's Last Exam

Reasoning
A benchmark of 2,500+ expert-level questions across dozens of academic disciplines, designed to be the hardest public AI evaluation. Questions contributed by domain experts worldwide.
Models Tested
9
Best Score
44.4
Median Score
11
Scoring: accuracy
Introduced: 2025-01
Maintainer: Scale AI / Center for AI Safety

Leaderboard (9 models)

#ModelDeveloperScore
🥇GeminiGoogle DeepMind
44.4
🥈Claude Opus 4.5Anthropic
43.2
🥉o3OpenAI
19.8%
4Gemini 2.5 ProGoogle DeepMind
18.8%
5Gemini 2.5 FlashGoogle DeepMind
11%
6DeepSeek R1DeepSeek
8.6%
7o1OpenAI
8%
8Claude 3.5 SonnetAnthropic
4.1%
9GPT-4oOpenAI
3.3%