Humanity's Last Exam
ReasoningA benchmark of 2,500+ expert-level questions across dozens of academic disciplines, designed to be the hardest public AI evaluation. Questions contributed by domain experts worldwide.
Models Tested
7
Best Score
19.8%
Median Score
8.6%
Scoring: accuracy
Introduced: 2025-01
Maintainer: Scale AI / Center for AI Safety
Leaderboard7 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o3 | OpenAI | 19.8% |
| 🥈 | Gemini 2.5 Pro | Google DeepMind | 18.8% |
| 🥉 | Gemini 2.5 Flash | Google DeepMind | 11% |
| 4 | DeepSeek R1 | DeepSeek | 8.6% |
| 5 | o1 | OpenAI | 8% |
| 6 | Claude 3.5 Sonnet | Anthropic | 4.1% |
| 7 | GPT-4o | OpenAI | 3.3% |