Skip to content
Longterm Wiki

MMLU

Knowledge

Massive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities.

Models Tested
37
Best Score
92.7%
Median Score
88.4%
Scoring: accuracy
Introduced: 2021-01
Maintainer: Dan Hendrycks et al.

Leaderboard37 models

#ModelDeveloperScore
🥇Grok-3xAI
92.7%
🥈o1OpenAI
92.3%
🥉Llama 4 MaverickMeta AI (FAIR)
92.2%
4o3OpenAI
92%
5o1-previewOpenAI
90.8%
6DeepSeek R1DeepSeek
90.8%
7Gemini 1.0 UltraGoogle DeepMind
90%
8Gemini 2.5 ProGoogle DeepMind
89.8%
9o4-miniOpenAI
89.7%
10Gemini 2.0 FlashGoogle DeepMind
89.7%
11GPT-4.1OpenAI
89.3%
12Llama 4 ScoutMeta AI (FAIR)
89.3%
13Claude Opus 4Anthropic
89%
14Claude Sonnet 4Anthropic
88.9%
15Claude 3.5 SonnetAnthropic
88.7%
16GPT-4oOpenAI
88.7%
17Llama 3.1Meta AI (FAIR)
88.6%
18DeepSeek V3DeepSeek
88.5%
19Claude 3.7 SonnetAnthropic
88.4%
20Grok-2xAI
87.5%
21o3-miniOpenAI
86.9%
22Claude 3 OpusAnthropic
86.8%
23GPT-4 TurboOpenAI
86.7%
24Gemini 2.5 FlashGoogle DeepMind
86.6%
25GPT-4OpenAI
86.4%
26Llama 3.3Meta AI (FAIR)
86%
27Gemini 1.5 ProGoogle DeepMind
85.9%
28o1-miniOpenAI
85.2%
29Mistral Large 2Mistral AI
84%
30GPT-4o miniOpenAI
82%
31Llama 3Meta AI (FAIR)
82%
32Claude 3.5 HaikuAnthropic
80.9%
33Gemini 1.5 FlashGoogle DeepMind
78.9%
34Claude 2Anthropic
78.5%
35Mixtral 8x7BMistral AI
70.6%
36GPT-3.5 TurboOpenAI
70%
37Llama 2Meta AI (FAIR)
68.9%