SWE-bench Verified
CodingA curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.
Models Tested
18
Best Score
80.9%
Median Score
65.95%
Scoring: percentage
Introduced: 2024-08
Maintainer: OpenAI / Princeton NLP
Leaderboard18 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude Opus 4.5 | Anthropic | 80.9% |
| 🥈 | Claude Opus 4.6 | Anthropic | 80.8% |
| 🥉 | Claude Sonnet 4.6 | Anthropic | 79.6% |
| 4 | Claude Sonnet 4.5 | Anthropic | 77.2% |
| 5 | Claude Sonnet 4 | Anthropic | 72.7% |
| 6 | Claude Opus 4 | Anthropic | 72.5% |
| 7 | Claude 3.7 Sonnet | Anthropic | 70.3% |
| 8 | o3 | OpenAI | 69.1% |
| 9 | o4-mini | OpenAI | 68.1% |
| 10 | Gemini 2.5 Pro | Google DeepMind | 63.8% |
| 11 | Gemini 2.5 Flash | Google DeepMind | 60.4% |
| 12 | GPT-4.1 | OpenAI | 54.6% |
| 13 | Grok-3 | xAI | 53.2% |
| 14 | o3-mini | OpenAI | 49.3% |
| 15 | DeepSeek R1 | DeepSeek | 49.2% |
| 16 | o1 | OpenAI | 48.9% |
| 17 | DeepSeek V3 | DeepSeek | 42% |
| 18 | Claude 3.5 Haiku | Anthropic | 40.6% |