SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

paper

2023·arXiv·arxiv.org/abs/2310.06770

Authors

Carlos E. Jimenez·John Yang·Alexander Wettig·Shunyu Yao·Kexin Pei·Ofir Press·Karthik Narasimhan

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

SWE-bench is a benchmark for evaluating language model capabilities on real-world software engineering tasks, providing a systematic evaluation framework relevant to understanding AI system performance on complex, practical problem-solving.

Paper Details

Citations

281 influential

Year

2025

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

Summary

SWE-bench is a new evaluation framework for assessing language models' ability to resolve real-world software engineering problems. It consists of 2,294 GitHub issues from 12 popular Python repositories, requiring models to edit codebases to fix issues. The benchmark demands complex reasoning including multi-file coordination, long context processing, and execution environment interaction. Current state-of-the-art models perform poorly on this task, with Claude 2 achieving only 1.96% success rate, indicating significant room for improvement in developing more practical and autonomous AI systems.

Cited by 2 pages

Page	Type	Quality
Autonomous Coding	Capability	63.0
Heavy Scaffolding / Agentic Systems	Concept	57.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

 
 
 Carlos E. Jimenez * 1,2  John Yang * 1,2  Alexander Wettig 1,2 

 
 Shunyu Yao 1,2  Kexin Pei 3  Ofir Press 1,2  Karthik Narasimhan 1,2 

 
 1 Princeton University   2 Princeton Language and Intelligence   3 University of Chicago
 
 

 
 Abstract

 Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities.
We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models.
We therefore introduce SWE-bench, an evaluation framework including 2294 2294 2294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 12 12 popular Python repositories.
Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.
Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation.
Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues.
Claude 2 and GPT-4 solve a mere 4.8 4.8 4.8 % and 1.7 1.7 1.7 % of instances respectively, even when provided with an oracle retriever.
Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous. † † Data, code, and leaderboard are available at https://www.swebench.com . 

 
 † † ∗ Equal contribution. Correspondence to { carlosej , jy1682 }@princeton.edu . 
 
 
 1 Introduction

 
 Language models (LMs) are rapidly being deployed in commercial products such as chatbots and coding assistants.
At the same time, existing benchmarks have become saturated  (Kiela et al., 2021 ; Ott et al., 2022 ) and fail to capture the frontier of what state-of-the-art LMs can and cannot do.
There is a need for challenging benchmarks that more accurately reflect real-world applications of LMs to help shape their future development and usage  (Srivastava et al., 2023 ) .

 
 
 Figure 1: 
SWE-bench sources task instances from real-world Python repositories by connecting GitHub issues to merged pull request solutions that resolve related tests.
Provided with the issue text and a codebase snapshot, models generate a patch that is evaluated against real tests.
 
 
 
 Building a good benchmark is difficult since tasks must be challenging enough to stump existing models, but model predictions must also be easy to verify  (Martínez-Plumed et al., 2021 ) .
Coding tasks are ap

... (truncated, 98 KB total)

Resource ID: 3e4a5dea3aec490f | Stable ID: sid_e92L8uQx73