SWE-bench Verified - OpenAI
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
SWE-bench Verified is a curated subset of the SWE-bench coding benchmark, important for those evaluating the real-world software engineering capabilities of AI agents, especially as agentic systems become more prominent in safety-relevant deployment contexts.
Metadata
Summary
OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verified. This improved benchmark provides more reliable and fair evaluations of AI models' ability to solve real-world software engineering tasks. It addresses concerns that inflated or misleading scores on the original benchmark obscured true model capabilities.
Key Points
- •Original SWE-bench contained problematic test cases that could produce misleading scores; SWE-bench Verified filters these out via human review.
- •Human software developers were recruited to validate samples, ensuring tasks are solvable and test requirements are unambiguous.
- •The verified subset enables more meaningful comparison of AI coding agents on realistic GitHub issue-resolution tasks.
- •Improves benchmark integrity by removing underspecified, flawed, or untestable samples from the evaluation set.
- •Relevant for tracking progress in agentic AI coding capabilities, a key frontier for AI safety and deployment considerations.
Review
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| Agentic AI | Capability | 68.0 |
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| Tool Use and Computer Use | Capability | 67.0 |
| Minimal Scaffolding | Capability | 52.0 |
e1f512a932def9e2 | Stable ID: Nzg3MThkY2