Skip to content
Longterm Wiki

SWE-bench Verified

Coding
A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.
Models Tested
23
Best Score
80.9%
Median Score
68.1%
Scoring: percentage
Introduced: 2024-08
Maintainer: OpenAI / Princeton NLP

Leaderboard (23 models)

#ModelDeveloperScore
🥇Claude Opus 4.5Anthropic
80.9%
🥈Claude Opus 4.6Anthropic
80.8%
🥉GeminiGoogle DeepMind
80.6
4Claude Sonnet 4.6Anthropic
79.6%
5Claude Sonnet 4.5Anthropic
77.2%
6Claude Opus 4.1Anthropic
74.5
7Claude Haiku 4.5Anthropic
73.3
8Claude Sonnet 4Anthropic
72.7%
9Claude Opus 4Anthropic
72.5%
10Claude 3.7 SonnetAnthropic
70.3%
11o3OpenAI
69.1%
12o4-miniOpenAI
68.1%
13Gemini 2.5 ProGoogle DeepMind
63.8%
14Gemini 2.5 FlashGoogle DeepMind
60.4%
15GPT-4.1OpenAI
54.6%
16Grok-3xAI
53.2%
17o3-miniOpenAI
49.3%
18DeepSeek R1DeepSeek
49.2%
19Claude 3.5 SonnetAnthropic
49
20ClaudeAnthropic
49%
21o1OpenAI
48.9%
22DeepSeek V3DeepSeek
42%
23Claude 3.5 HaikuAnthropic
40.6%