Skip to content
Longterm Wiki

SWE-bench Verified

Coding

A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.

Models Tested
18
Best Score
80.9%
Median Score
65.95%
Scoring: percentage
Introduced: 2024-08
Maintainer: OpenAI / Princeton NLP

Leaderboard18 models

#ModelDeveloperScore
🥇Claude Opus 4.5Anthropic
80.9%
🥈Claude Opus 4.6Anthropic
80.8%
🥉Claude Sonnet 4.6Anthropic
79.6%
4Claude Sonnet 4.5Anthropic
77.2%
5Claude Sonnet 4Anthropic
72.7%
6Claude Opus 4Anthropic
72.5%
7Claude 3.7 SonnetAnthropic
70.3%
8o3OpenAI
69.1%
9o4-miniOpenAI
68.1%
10Gemini 2.5 ProGoogle DeepMind
63.8%
11Gemini 2.5 FlashGoogle DeepMind
60.4%
12GPT-4.1OpenAI
54.6%
13Grok-3xAI
53.2%
14o3-miniOpenAI
49.3%
15DeepSeek R1DeepSeek
49.2%
16o1OpenAI
48.9%
17DeepSeek V3DeepSeek
42%
18Claude 3.5 HaikuAnthropic
40.6%