SWE-bench Pro Leaderboard - Scale AI

web

scale.com·scale.com/leaderboard/swe_bench_pro_public

Useful for tracking the state of AI coding agent capabilities; relevant to discussions of AI autonomy, capability evaluations, and the pace of progress toward AI systems that can perform complex software engineering tasks independently.

Metadata

Importance: 45/100tool pagereference

Summary

SWE-bench Pro is a rigorous benchmark by Scale AI that evaluates AI agents on real-world software engineering tasks drawn from both public and private repositories. It addresses limitations of existing benchmarks by emphasizing realistic, challenging problem-solving scenarios. The leaderboard tracks and compares performance of leading AI coding agents.

Key Points

•Evaluates AI agents on software engineering tasks sourced from public and private repositories, reducing benchmark contamination risks
•Designed to address limitations of prior coding benchmarks by focusing on realistic, difficult problem-solving scenarios
•Provides a public leaderboard ranking AI agent performance on software engineering tasks
•Relevant for assessing current AI coding capabilities and tracking progress toward autonomous software development
•Produced by Scale AI, a major player in AI data labeling and evaluation infrastructure

Review

SWE-Bench Pro represents a significant advancement in AI agent evaluation for software engineering tasks. By addressing critical limitations in existing benchmarks, such as data contamination, limited task diversity, and oversimplified problems, the benchmark offers a more authentic assessment of AI problem-solving capabilities. The methodology involves a sophisticated four-stage workflow that carefully sources, creates, and augments software engineering challenges from diverse repositories. The benchmark's key innovation lies in its rigorous design, which includes three distinct dataset subsets: a public set, a commercial set, and a held-out set. This approach allows for comprehensive testing across different coding environments and provides a more nuanced understanding of AI agents' generalization abilities. The results are striking, with top models like OpenAI GPT-5 and Claude Opus 4.1 scoring only around 23% on the public dataset, compared to 70%+ on previous benchmarks. This dramatic performance drop highlights the benchmark's increased complexity and its potential to drive meaningful improvements in AI software engineering capabilities.

Cited by 3 pages

Page	Type	Quality
Autonomous Coding	Capability	63.0
Long-Horizon Autonomous Tasks	Capability	65.0
Tool Use and Computer Use	Capability	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202611 KB

SWE-Bench Pro Leaderboard AI Coding Benchmark (Public Dataset) | Scale Leaderboard 2025 Scale AI. All rights reserved. BACK SWE-Bench Pro (Public Dataset)

 Evaluating challenging long-horizon software engineering tasks in public open source repositories

 SWE-Bench Pro

 SWE-Bench Pro is a benchmark designed to provide a rigorous and realistic evaluation of AI agents for software engineering. It was developed to address several limitations in existing benchmarks by tackling four key challenges:

 Data Contamination : Models have likely seen the evaluation code during training, making it hard to know if they are problem-solving or recalling a memorized solution.

 Limited Task Diversity : Many benchmarks fail to capture the full spectrum of real-world software challenges and instead focus on simple utility libraries.

 Oversimplified Problems : Ambiguous or underspecified issues are often removed from benchmarks, which doesn&#x27;t reflect a real developer&#x27;s workflow.

 Unreliable and Irreproducible Testing : Inconsistent setups make it difficult to know if a solution truly works or if the environment is just configured incorrectly.

 SWE-Bench Pro addresses these gaps by sourcing tasks from diverse and complex codebases, including consumer applications, B2B services, and developer tools. To reduce contamination risk, the public and held-out OSS subsets use strong copyleft licenses (e.g., GPL). The private subset consists of proprietary codebases from startup partners. 

 The benchmark is significantly more challenging than its predecessors; top models score around 23% on the SWE-Bench Pro public set, compared to 70%+ on SWE-Bench Verified. This provides a more accurate measure of an agent’s true problem-solving capabilities in environments that mirror professional software development.

 Read the paper here: https://scale.com/research/swe_bench_pro 

 Check out the GitHub and view trajectories here: https://docent.transluce.org/dashboard/032fb63d-4992-4bfc-911d-3b7dafcb931f 

 Methodology

 Each problem in SWE-Bench Pro is created using a four-stage workflow:

 Sourcing : Repositories are selected from a curated set of public and private repositories

 Environment Creation : Professional engineers build reproducible Docker-based environments, integrating all dependencies and build tools to ensure the codebase and tests run out-of-the-box.

 Harvesting : Problems are extracted via commit scraping. Pairs of consecutive commits are retained if they (a) fix a bug or introduce a feature, (b) demonstrate a fail-to-pass transition for new tests, and (c) include pass-to-pass tests confirming unrelated functionality remains intact.

 Augmentation : Human experts organize unstructured commits and issue metadata into two artifacts: a problem statement and a requirements brief with an optional interface. These provide sufficient context to reproduce the gold patch without prescribing an implementation. We employ three human-in-the-loop checkpoints: (1)

... (truncated, 11 KB total)

Resource ID: 9dbe484d48b6787a | Stable ID: sid_T1XNpba20c