Skip to content
Longterm Wiki
Massive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities.
Models Tested
48
Best Score
92.7%
Median Score
87.95%
Scoring: accuracy
Introduced: 2021-01
Maintainer: Dan Hendrycks et al.

Leaderboard (48 models)

#ModelDeveloperScore
🥇Grok-3xAI
92.7%
🥈GeminiGoogle DeepMind
92.4
🥉o1OpenAI
92.3%
4Llama 4 MaverickMeta AI (FAIR)
92.2%
5Claude Opus 4.6Anthropic
92.1
6o3OpenAI
92%
7DeepSeek R1DeepSeek
90.8%
8o1-previewOpenAI
90.8%
9ClaudeAnthropic
90.5%
10Gemini 1.0 UltraGoogle DeepMind
90%
11Gemini 2.5 ProGoogle DeepMind
89.8%
12Gemini 2.0 FlashGoogle DeepMind
89.7%
13o4-miniOpenAI
89.7%
14Llama 4 ScoutMeta AI (FAIR)
89.3%
15GPT-4.1OpenAI
89.3%
16Claude Opus 4Anthropic
89%
17Claude Sonnet 4Anthropic
88.9%
18Claude 3.5 SonnetAnthropic
88.7%
19GPTOpenAI
88.7
20GPT-4oOpenAI
88.7%
21Llama 3.1Meta AI (FAIR)
88.6%
22DeepSeek ModelsDeepSeek
88.5
23DeepSeek V3DeepSeek
88.5%
24Claude 3.7 SonnetAnthropic
88.4%
25Grok-2xAI
87.5%
26LlamaMeta AI (FAIR)
87.3
27o3-miniOpenAI
86.9%
28Claude 3 OpusAnthropic
86.8%
29GPT-4 TurboOpenAI
86.7%
30Gemini 2.5 FlashGoogle DeepMind
86.6%
31GPT-4OpenAI
86.4%
32Llama 3.3Meta AI (FAIR)
86%
33Gemini 1.5 ProGoogle DeepMind
85.9%
34o1-miniOpenAI
85.2%
35Mistral Large 2Mistral AI
84%
36Llama 3Meta AI (FAIR)
82%
37GPT-4o miniOpenAI
82%
38Claude 3.5 HaikuAnthropic
80.9%
39GPT-4.1 nanoOpenAI
80.1
40GPT-4.1 miniOpenAI
80.1
41Gemini 1.5 FlashGoogle DeepMind
78.9%
42Claude 2Anthropic
78.5%
43Claude 3 SonnetAnthropic
78.3
44Claude 3 HaikuAnthropic
75.2%
45Mixtral 8x7BMistral AI
70.6%
46GPT-3.5 TurboOpenAI
70
47Llama 2Meta AI (FAIR)
68.9%
48MistralMistral AI
60.1