Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

SWE-bench Verified is a curated subset of the SWE-bench coding benchmark, important for those evaluating the real-world software engineering capabilities of AI agents, especially as agentic systems become more prominent in safety-relevant deployment contexts.

Metadata

Importance: 52/100blog postdataset

Summary

OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verified. This improved benchmark provides more reliable and fair evaluations of AI models' ability to solve real-world software engineering tasks. It addresses concerns that inflated or misleading scores on the original benchmark obscured true model capabilities.

Key Points

  • Original SWE-bench contained problematic test cases that could produce misleading scores; SWE-bench Verified filters these out via human review.
  • Human software developers were recruited to validate samples, ensuring tasks are solvable and test requirements are unambiguous.
  • The verified subset enables more meaningful comparison of AI coding agents on realistic GitHub issue-resolution tasks.
  • Improves benchmark integrity by removing underspecified, flawed, or untestable samples from the evaluation set.
  • Relevant for tracking progress in agentic AI coding capabilities, a key frontier for AI safety and deployment considerations.

Review

OpenAI's SWE-bench Verified represents a significant advancement in AI model evaluation for software engineering tasks. By systematically screening 1,699 samples with 93 professional software developers, they identified critical issues in the original benchmark that could systematically underestimate AI models' capabilities. The key problems included underspecified issue descriptions, overly specific or unrelated unit tests, and unreliable development environment setups. The research methodology involved a rigorous human annotation process where each sample was labeled three times across multiple criteria, including problem specification clarity, test validity, and task difficulty. This approach led to filtering out 68.3% of the original samples, resulting in a more robust 500-sample dataset. Notably, the GPT-4o model's performance improved from 16% to 33.2% on this verified dataset, demonstrating that the original benchmark was indeed constraining. The work highlights the importance of continuous improvement in AI evaluation benchmarks and the need for careful, nuanced assessment of AI capabilities.

Cited by 4 pages

PageTypeQuality
Agentic AICapability68.0
Long-Horizon Autonomous TasksCapability65.0
Tool Use and Computer UseCapability67.0
Minimal ScaffoldingCapability52.0
Resource ID: e1f512a932def9e2 | Stable ID: Nzg3MThkY2