Skip to content
Longterm Wiki

Humanity's Last Exam

Reasoning

A benchmark of 2,500+ expert-level questions across dozens of academic disciplines, designed to be the hardest public AI evaluation. Questions contributed by domain experts worldwide.

Models Tested
7
Best Score
19.8%
Median Score
8.6%
Scoring: accuracy
Introduced: 2025-01
Maintainer: Scale AI / Center for AI Safety

Leaderboard7 models

#ModelDeveloperScore
🥇o3OpenAI
19.8%
🥈Gemini 2.5 ProGoogle DeepMind
18.8%
🥉Gemini 2.5 FlashGoogle DeepMind
11%
4DeepSeek R1DeepSeek
8.6%
5o1OpenAI
8%
6Claude 3.5 SonnetAnthropic
4.1%
7GPT-4oOpenAI
3.3%