HumanEval
CodingA benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.
Models Tested
25
Best Score
94%
Median Score
88.4%
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI
Leaderboard25 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o1 | OpenAI | 94% |
| 🥈 | Grok-3 | xAI | 93% |
| 🥉 | Gemini 2.5 Pro | Google DeepMind | 92.7% |
| 4 | o1-preview | OpenAI | 92.4% |
| 5 | o1-mini | OpenAI | 92.4% |
| 6 | Claude 3.5 Sonnet | Anthropic | 92% |
| 7 | Mistral Large 2 | Mistral AI | 92% |
| 8 | DeepSeek R1 | DeepSeek | 92% |
| 9 | GPT-4o | OpenAI | 90.2% |
| 10 | Gemini 2.0 Flash | Google DeepMind | 89% |
| 11 | Llama 3.1 | Meta AI (FAIR) | 89% |
| 12 | Llama 3.3 | Meta AI (FAIR) | 88.4% |
| 13 | Grok-2 | xAI | 88.4% |
| 14 | GPT-4 Turbo | OpenAI | 88.2% |
| 15 | Claude 3.5 Haiku | Anthropic | 88.1% |
| 16 | GPT-4o mini | OpenAI | 87.2% |
| 17 | Claude 3 Opus | Anthropic | 84.9% |
| 18 | DeepSeek V3 | DeepSeek | 82.6% |
| 19 | Llama 3 | Meta AI (FAIR) | 81.7% |
| 20 | Gemini 1.0 Ultra | Google DeepMind | 74.4% |
| 21 | Gemini 1.5 Flash | Google DeepMind | 74.3% |
| 22 | Gemini 1.5 Pro | Google DeepMind | 71.9% |
| 23 | Claude 2 | Anthropic | 71.2% |
| 24 | GPT-4 | OpenAI | 67% |
| 25 | GPT-3.5 Turbo | OpenAI | 48.1% |