SWE-bench Verified - OpenAI

web

OpenAI·openai.com/index/introducing-swe-bench-verified/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

SWE-bench Verified is a curated subset of the SWE-bench coding benchmark, important for those evaluating the real-world software engineering capabilities of AI agents, especially as agentic systems become more prominent in safety-relevant deployment contexts.

Metadata

Importance: 52/100blog postdataset

Summary

OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verified. This improved benchmark provides more reliable and fair evaluations of AI models' ability to solve real-world software engineering tasks. It addresses concerns that inflated or misleading scores on the original benchmark obscured true model capabilities.

Key Points

•Original SWE-bench contained problematic test cases that could produce misleading scores; SWE-bench Verified filters these out via human review.
•Human software developers were recruited to validate samples, ensuring tasks are solvable and test requirements are unambiguous.
•The verified subset enables more meaningful comparison of AI coding agents on realistic GitHub issue-resolution tasks.
•Improves benchmark integrity by removing underspecified, flawed, or untestable samples from the evaluation set.
•Relevant for tracking progress in agentic AI coding capabilities, a key frontier for AI safety and deployment considerations.

Review

OpenAI's SWE-bench Verified represents a significant advancement in AI model evaluation for software engineering tasks. By systematically screening 1,699 samples with 93 professional software developers, they identified critical issues in the original benchmark that could systematically underestimate AI models' capabilities. The key problems included underspecified issue descriptions, overly specific or unrelated unit tests, and unreliable development environment setups. The research methodology involved a rigorous human annotation process where each sample was labeled three times across multiple criteria, including problem specification clarity, test validity, and task difficulty. This approach led to filtering out 68.3% of the original samples, resulting in a more robust 500-sample dataset. Notably, the GPT-4o model's performance improved from 16% to 33.2% on this verified dataset, demonstrating that the original benchmark was indeed constraining. The work highlights the importance of continuous improvement in AI evaluation benchmarks and the need for careful, nuanced assessment of AI capabilities.

Cited by 4 pages

Page	Type	Quality
Agentic AI	Capability	68.0
Long-Horizon Autonomous Tasks	Capability	65.0
Tool Use and Computer Use	Capability	67.0
Minimal Scaffolding	Capability	52.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202623 KB

Introducing SWE-bench Verified | OpenAI

 

 
 
 
 

 Mar
 APR
 May
 

 
 

 
 07
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260407064632/https://openai.com/index/introducing-swe-bench-verified/

 

Skip to main content

li:hover)>li:not(:hover)>*]:text-primary-60 flex h-full min-w-0 items-baseline gap-0 overflow-x-hidden whitespace-nowrap [-ms-overflow-style:none] [scrollbar-width:none] focus-within:overflow-visible [&::-webkit-scrollbar]:hidden">
Research

Products

Business

Developers

Company

Foundation(opens in a new window)

Log in

Try ChatGPT

(opens in a new window)

Research

Products

Business

Developers

Company

Foundation

(opens in a new window)

Try ChatGPT

(opens in a new window)Login

OpenAI

Table of contents

Background on SWE-bench

Adapting SWE-bench as a Preparedness Evaluation

SWE-bench Verified

Our Approach

Annotation Results

Performance on SWE-bench Verified

Discussion & Limitations

Data downloads

August 13, 2024
Milestone

Introducing SWE-bench Verified

We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues. 

Download SWE-bench Verified

(opens in a new window)

Loading…

Share

Updated February 24, 2025

As part of our Preparedness Framework⁠, OpenAI develops a range of metrics to track, evaluate, and forecast models’ abilities to act autonomously. The ability to autonomously complete software engineering tasks is a key component of our Medium risk level in the Model Autonomy risk category. Evaluating these capabilities is challenging due to the complexity of software engineering tasks, the difficulty of accurately assessing generated code, and the challenge of simulating real-world development scenarios. Therefore, our approach to Preparedness must also involve careful examination of evaluations themselves, to reduce the potential for underestimating or overestimating performance in important risk categories.

One of the most popular evaluation suites for software engineering is SWE-bench⁠(opens in a new window)1—a benchmark for evaluating large language models’ (LLMs’) abilities to solve real-world software issues sourced from GitHub. The benchmark involves giving agents a code repository and issue description, and challenging them to generate a patch that resolves the problem described by the issue. Coding agents have made impressive progress on SWE-bench, with top scoring agents scoring 20% on SWE-bench and 43% on SWE-bench Lite according to the SWE-bench leaderboard⁠(opens in a new window) as of August 5, 2024. 

Our testing identified some SWE-bench tasks which may be hard or impossible to solve, leading to SWE-bench systematically underestimating models’ aut

... (truncated, 23 KB total)

Resource ID: e1f512a932def9e2 | Stable ID: sid_gHpYPycDK1