Skip to content
Longterm Wiki

GPQA Diamond

Reasoning

Graduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with.

Models Tested
34
Best Score
91.3%
Median Score
62.5%
Scoring: accuracy
Introduced: 2023-11
Maintainer: David Rein et al.

Leaderboard34 models

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
91.3%
🥈Claude Opus 4.5Anthropic
87%
🥉Gemini 2.5 ProGoogle DeepMind
84%
4o3OpenAI
83.3%
5Gemini 2.5 FlashGoogle DeepMind
82.8%
6o4-miniOpenAI
81.4%
7Grok-3xAI
80%
8o3-miniOpenAI
79.7%
9o1OpenAI
79.2%
10o1-previewOpenAI
78%
11Claude Opus 4Anthropic
74.1%
12Claude Sonnet 4.6Anthropic
74.1%
13DeepSeek R1DeepSeek
71.5%
14Claude Sonnet 4Anthropic
70.3%
15Llama 4 MaverickMeta AI (FAIR)
69.8%
16Claude 3.7 SonnetAnthropic
68%
17Claude 3.5 SonnetAnthropic
65%
18o1-miniOpenAI
60%
19DeepSeek V3DeepSeek
59.1%
20Llama 4 ScoutMeta AI (FAIR)
57.2%
21Gemini 2.0 FlashGoogle DeepMind
57%
22GPT-4.1OpenAI
56.4%
23Grok-2xAI
56.4%
24GPT-4oOpenAI
53.6%
25Llama 3.1Meta AI (FAIR)
50.7%
26Claude 3 OpusAnthropic
50.4%
27GPT-4 TurboOpenAI
49.3%
28Llama 3.3Meta AI (FAIR)
49.2%
29Mistral Large 2Mistral AI
43.9%
30Claude 3.5 HaikuAnthropic
41.6%
31GPT-4o miniOpenAI
39.8%
32Llama 3Meta AI (FAIR)
39.5%
33GPT-4OpenAI
35.7%
34Gemini 1.0 UltraGoogle DeepMind
35.4%