Skip to content
Longterm Wiki

HumanEval

Coding

A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.

Models Tested
25
Best Score
94%
Median Score
88.4%
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI

Leaderboard25 models

#ModelDeveloperScore
🥇o1OpenAI
94%
🥈Grok-3xAI
93%
🥉Gemini 2.5 ProGoogle DeepMind
92.7%
4o1-previewOpenAI
92.4%
5o1-miniOpenAI
92.4%
6Claude 3.5 SonnetAnthropic
92%
7Mistral Large 2Mistral AI
92%
8DeepSeek R1DeepSeek
92%
9GPT-4oOpenAI
90.2%
10Gemini 2.0 FlashGoogle DeepMind
89%
11Llama 3.1Meta AI (FAIR)
89%
12Llama 3.3Meta AI (FAIR)
88.4%
13Grok-2xAI
88.4%
14GPT-4 TurboOpenAI
88.2%
15Claude 3.5 HaikuAnthropic
88.1%
16GPT-4o miniOpenAI
87.2%
17Claude 3 OpusAnthropic
84.9%
18DeepSeek V3DeepSeek
82.6%
19Llama 3Meta AI (FAIR)
81.7%
20Gemini 1.0 UltraGoogle DeepMind
74.4%
21Gemini 1.5 FlashGoogle DeepMind
74.3%
22Gemini 1.5 ProGoogle DeepMind
71.9%
23Claude 2Anthropic
71.2%
24GPT-4OpenAI
67%
25GPT-3.5 TurboOpenAI
48.1%