Skip to content
Longterm Wiki

GPQA Diamond

Reasoning
Graduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with.
Models Tested
42
Best Score
94.3
Median Score
65
Scoring: accuracy
Introduced: 2023-11
Maintainer: David Rein et al.

Leaderboard (42 models)

#ModelDeveloperScore
🥇GeminiGoogle DeepMind
94.3
🥈Claude Opus 4.6Anthropic
91.3%
🥉Claude Opus 4.5Anthropic
87%
4Gemini 2.5 ProGoogle DeepMind
84%
5Claude Sonnet 4.5Anthropic
83.4%
6o3OpenAI
83.3%
7Gemini 2.5 FlashGoogle DeepMind
82.8%
8o4-miniOpenAI
81.4%
9Claude Opus 4.1Anthropic
80.9
10Grok-3xAI
80%
11o3-miniOpenAI
79.7%
12o1OpenAI
79.2%
13o1-previewOpenAI
78%
14Claude Opus 4Anthropic
74.1%
15Claude Sonnet 4.6Anthropic
74.1%
16DeepSeek R1DeepSeek
71.5%
17Claude Sonnet 4Anthropic
70.3%
18Llama 4 MaverickMeta AI (FAIR)
69.8%
19Claude 3.7 SonnetAnthropic
68%
20Claude 3.5 SonnetAnthropic
65%
21GPT-4.1 miniOpenAI
65
22ClaudeAnthropic
65%
23o1-miniOpenAI
60%
24DeepSeek ModelsDeepSeek
59.1
25DeepSeek V3DeepSeek
59.1%
26Llama 4 ScoutMeta AI (FAIR)
57.2%
27Gemini 2.0 FlashGoogle DeepMind
57%
28Grok-2xAI
56.4%
29GPT-4.1OpenAI
56.4%
30GPT-4oOpenAI
53.6%
31Llama 3.1Meta AI (FAIR)
50.7%
32Claude 3 OpusAnthropic
50.4%
33GPT-4 TurboOpenAI
49.3%
34Llama 3.3Meta AI (FAIR)
49.2%
35Mistral Large 2Mistral AI
43.9%
36Claude 3.5 HaikuAnthropic
41.6%
37Claude 3 SonnetAnthropic
40.4%
38GPT-4o miniOpenAI
39.8%
39Llama 3Meta AI (FAIR)
39.5%
40GPT-4OpenAI
35.7%
41Gemini 1.0 UltraGoogle DeepMind
35.4%
42Claude 3 HaikuAnthropic
33.3%