Models Tested
38
Best Score
95.4
Median Score
88.7
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI
Leaderboard (38 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude Opus 4.6 | Anthropic | 95.4 |
| 🥈 | Claude 3.7 Sonnet | Anthropic | 94 |
| 🥉 | o1 | OpenAI | 94% |
| 4 | Claude | Anthropic | 93.7% |
| 5 | Grok-3 | xAI | 93% |
| 6 | Gemini 2.5 Pro | Google DeepMind | 92.7% |
| 7 | o1-mini | OpenAI | 92.4% |
| 8 | o1-preview | OpenAI | 92.4% |
| 9 | Claude Opus 4.5 | Anthropic | 92 |
| 10 | Claude 3.5 Sonnet | Anthropic | 92% |
| 11 | DeepSeek R1 | DeepSeek | 92% |
| 12 | Mistral Large 2 | Mistral AI | 92% |
| 13 | GPT | OpenAI | 90.2 |
| 14 | GPT-4o | OpenAI | 90.2% |
| 15 | Gemini | Google DeepMind | 89.7 |
| 16 | Claude Opus 4.1 | Anthropic | 89.5 |
| 17 | Llama | Meta AI (FAIR) | 89 |
| 18 | Gemini 2.0 Flash | Google DeepMind | 89% |
| 19 | Llama 3.1 | Meta AI (FAIR) | 89% |
| 20 | Llama 3.3 | Meta AI (FAIR) | 88.4% |
| 21 | Grok-2 | xAI | 88.4% |
| 22 | GPT-4 Turbo | OpenAI | 88.2% |
| 23 | Claude 3.5 Haiku | Anthropic | 88.1% |
| 24 | GPT-4o mini | OpenAI | 87.2% |
| 25 | Grok | xAI | 86.5 |
| 26 | Claude 3 Opus | Anthropic | 84.9% |
| 27 | DeepSeek V3 | DeepSeek | 82.6% |
| 28 | Llama 3 | Meta AI (FAIR) | 81.7% |
| 29 | Claude 3 Haiku | Anthropic | 75.9% |
| 30 | Gemini 1.0 Ultra | Google DeepMind | 74.4% |
| 31 | Gemini 1.5 Flash | Google DeepMind | 74.3% |
| 32 | Claude 3 Sonnet | Anthropic | 73 |
| 33 | Gemini 1.5 Pro | Google DeepMind | 71.9% |
| 34 | Claude 2 | Anthropic | 71.2% |
| 35 | GPT-3.5 Turbo | OpenAI | 68 |
| 36 | GPT-4 | OpenAI | 67% |
| 37 | DeepSeek Models | DeepSeek | 65.2 |
| 38 | Mistral | Mistral AI | 30.5 |