MMLU
KnowledgeMassive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities.
Models Tested
37
Best Score
92.7%
Median Score
88.4%
Scoring: accuracy
Introduced: 2021-01
Maintainer: Dan Hendrycks et al.
Leaderboard37 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Grok-3 | xAI | 92.7% |
| 🥈 | o1 | OpenAI | 92.3% |
| 🥉 | Llama 4 Maverick | Meta AI (FAIR) | 92.2% |
| 4 | o3 | OpenAI | 92% |
| 5 | o1-preview | OpenAI | 90.8% |
| 6 | DeepSeek R1 | DeepSeek | 90.8% |
| 7 | Gemini 1.0 Ultra | Google DeepMind | 90% |
| 8 | Gemini 2.5 Pro | Google DeepMind | 89.8% |
| 9 | o4-mini | OpenAI | 89.7% |
| 10 | Gemini 2.0 Flash | Google DeepMind | 89.7% |
| 11 | GPT-4.1 | OpenAI | 89.3% |
| 12 | Llama 4 Scout | Meta AI (FAIR) | 89.3% |
| 13 | Claude Opus 4 | Anthropic | 89% |
| 14 | Claude Sonnet 4 | Anthropic | 88.9% |
| 15 | Claude 3.5 Sonnet | Anthropic | 88.7% |
| 16 | GPT-4o | OpenAI | 88.7% |
| 17 | Llama 3.1 | Meta AI (FAIR) | 88.6% |
| 18 | DeepSeek V3 | DeepSeek | 88.5% |
| 19 | Claude 3.7 Sonnet | Anthropic | 88.4% |
| 20 | Grok-2 | xAI | 87.5% |
| 21 | o3-mini | OpenAI | 86.9% |
| 22 | Claude 3 Opus | Anthropic | 86.8% |
| 23 | GPT-4 Turbo | OpenAI | 86.7% |
| 24 | Gemini 2.5 Flash | Google DeepMind | 86.6% |
| 25 | GPT-4 | OpenAI | 86.4% |
| 26 | Llama 3.3 | Meta AI (FAIR) | 86% |
| 27 | Gemini 1.5 Pro | Google DeepMind | 85.9% |
| 28 | o1-mini | OpenAI | 85.2% |
| 29 | Mistral Large 2 | Mistral AI | 84% |
| 30 | GPT-4o mini | OpenAI | 82% |
| 31 | Llama 3 | Meta AI (FAIR) | 82% |
| 32 | Claude 3.5 Haiku | Anthropic | 80.9% |
| 33 | Gemini 1.5 Flash | Google DeepMind | 78.9% |
| 34 | Claude 2 | Anthropic | 78.5% |
| 35 | Mixtral 8x7B | Mistral AI | 70.6% |
| 36 | GPT-3.5 Turbo | OpenAI | 70% |
| 37 | Llama 2 | Meta AI (FAIR) | 68.9% |