Skip to content
Longterm Wiki

MMLU-Pro

Knowledge

A harder variant of MMLU with 10 answer choices (vs 4), chain-of-thought reasoning, and reduced sensitivity to prompt format. Designed to better discriminate among top models.

Models Tested
15
Best Score
84%
Median Score
76%
Scoring: accuracy
Introduced: 2024-06
Maintainer: TIGER Lab

Leaderboard15 models

#ModelDeveloperScore
🥇DeepSeek R1DeepSeek
84%
🥈Llama 4 MaverickMeta AI (FAIR)
80.5%
🥉o1OpenAI
80.3%
4Grok-3xAI
79.9%
5Gemini 2.5 ProGoogle DeepMind
79.1%
6Claude 3.5 SonnetAnthropic
78%
7Gemini 2.0 FlashGoogle DeepMind
76.4%
8Gemini 2.5 FlashGoogle DeepMind
76%
9DeepSeek V3DeepSeek
75.9%
10GPT-4.1OpenAI
74.3%
11Llama 4 ScoutMeta AI (FAIR)
74.3%
12GPT-4oOpenAI
72.6%
13Llama 3.3Meta AI (FAIR)
68.9%
14Gemini 1.5 FlashGoogle DeepMind
67.3%
15GPT-4o miniOpenAI
63.2%