Skip to content
Longterm Wiki

HumanEval

Coding
A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.
Models Tested
38
Best Score
95.4
Median Score
88.7
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI

Leaderboard (38 models)

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
95.4
🥈Claude 3.7 SonnetAnthropic
94
🥉o1OpenAI
94%
4ClaudeAnthropic
93.7%
5Grok-3xAI
93%
6Gemini 2.5 ProGoogle DeepMind
92.7%
7o1-miniOpenAI
92.4%
8o1-previewOpenAI
92.4%
9Claude Opus 4.5Anthropic
92
10Claude 3.5 SonnetAnthropic
92%
11DeepSeek R1DeepSeek
92%
12Mistral Large 2Mistral AI
92%
13GPTOpenAI
90.2
14GPT-4oOpenAI
90.2%
15GeminiGoogle DeepMind
89.7
16Claude Opus 4.1Anthropic
89.5
17LlamaMeta AI (FAIR)
89
18Gemini 2.0 FlashGoogle DeepMind
89%
19Llama 3.1Meta AI (FAIR)
89%
20Llama 3.3Meta AI (FAIR)
88.4%
21Grok-2xAI
88.4%
22GPT-4 TurboOpenAI
88.2%
23Claude 3.5 HaikuAnthropic
88.1%
24GPT-4o miniOpenAI
87.2%
25GrokxAI
86.5
26Claude 3 OpusAnthropic
84.9%
27DeepSeek V3DeepSeek
82.6%
28Llama 3Meta AI (FAIR)
81.7%
29Claude 3 HaikuAnthropic
75.9%
30Gemini 1.0 UltraGoogle DeepMind
74.4%
31Gemini 1.5 FlashGoogle DeepMind
74.3%
32Claude 3 SonnetAnthropic
73
33Gemini 1.5 ProGoogle DeepMind
71.9%
34Claude 2Anthropic
71.2%
35GPT-3.5 TurboOpenAI
68
36GPT-4OpenAI
67%
37DeepSeek ModelsDeepSeek
65.2
38MistralMistral AI
30.5