Skip to content
Longterm Wiki

MMLU-Pro

Knowledge
A harder variant of MMLU with 10 answer choices (vs 4), chain-of-thought reasoning, and reduced sensitivity to prompt format. Designed to better discriminate among top models.
Models Tested
22
Best Score
90.99
Median Score
77.2
Scoring: accuracy
Introduced: 2024-06
Maintainer: TIGER Lab

Leaderboard (22 models)

#ModelDeveloperScore
🥇GeminiGoogle DeepMind
90.99
🥈Claude Opus 4.5Anthropic
89.5
🥉DeepSeek R1DeepSeek
84%
4Llama 4 MaverickMeta AI (FAIR)
80.5%
5o1OpenAI
80.3%
6GrokxAI
79.9
7Grok-3xAI
79.9%
8Gemini 2.5 ProGoogle DeepMind
79.1%
9Claude 3.7 SonnetAnthropic
78.4
10Claude 3.5 SonnetAnthropic
78%
11ClaudeAnthropic
78%
12Gemini 2.0 FlashGoogle DeepMind
76.4%
13Gemini 2.5 FlashGoogle DeepMind
76%
14DeepSeek ModelsDeepSeek
75.9
15DeepSeek V3DeepSeek
75.9%
16Llama 4 ScoutMeta AI (FAIR)
74.3%
17GPT-4.1OpenAI
74.3%
18LlamaMeta AI (FAIR)
73.3
19GPT-4oOpenAI
72.6%
20Llama 3.3Meta AI (FAIR)
68.9%
21Gemini 1.5 FlashGoogle DeepMind
67.3%
22GPT-4o miniOpenAI
63.2%