Back
First autonomous coding agent
webcognition.ai·cognition.ai/blog/swe-bench-technical-report
Landmark capability demonstration from early 2024 showing autonomous AI agents can resolve real-world software engineering tasks; relevant for tracking AI capability progression and understanding benchmark design for agentic systems.
Metadata
Importance: 62/100blog postprimary source
Summary
Cognition AI's technical report on Devin, their autonomous coding agent, which achieved 13.86% on the SWE-bench benchmark—far exceeding the previous unassisted baseline of 1.96%. The report details their evaluation methodology for adapting SWE-bench from LLM evaluation to agent evaluation, including how Devin navigates codebases independently without file hints.
Key Points
- •Devin resolved 13.86% of SWE-bench issues unassisted, vs. previous best of 1.96% unassisted and 4.80% assisted by top LLMs.
- •SWE-bench tests real-world software engineering via GitHub issues/PRs with deterministic unit test evaluation across 2,294 instances.
- •Methodology adapted SWE-bench for agents: no file hints given, 45-minute runtime limit, agent navigates repo autonomously.
- •Evaluation harness resets test files post-run to prevent agents from gaming tests, then applies agent's patch before running tests.
- •Represents a significant capability milestone for autonomous AI agents performing complex, multi-file software engineering tasks.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Agentic AI | Capability | 68.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202614 KB
[Blog](https://cognition.ai/blog/1)/ [Research](https://cognition.ai/blog/Research/1)
March 15, 2024
# SWE-bench technical report
by The Cognition Team
In this article:
To evaluate Devin, we turn to [SWE-bench](https://www.swebench.com/), an automated benchmark for software engineering systems consisting of GitHub issues and pull requests. We believe SWE-bench is a great choice because it deterministically evaluates (via unit tests) a system’s ability to solve issues in real world codebases, unlike benchmarks like [HumanEval](https://github.com/openai/human-eval) which are limited to standalone functions.
In SWE-bench, Devin successfully resolves 13.86%\* of issues, far exceeding the previous highest unassisted baseline of 1.96%. Even when given the exact files to edit ("assisted"), the best previous model only resolves 4.80% of issues.
We provide our evaluation harness and Devin’s code edits at [https://github.com/CognitionAI/devin-swebench-results](https://github.com/CognitionAI/devin-swebench-results).
## Background
SWE-bench is a dataset of 2,294 issues and pull requests scraped from popular open source Python repositories on GitHub. Its goal is to test a system’s ability to write real-world code.
Each SWE-bench instance consists of a GitHub issue and the pull request which resolved it. The pull request must include a unit test which fails before the code change and passes after (called a “fail to pass” test). The diff is split into two parts, patch and test\_patch, which contain the code changes and the test changes, respectively.
Then the system being evaluated is asked to generate a diff given the GitHub issue description and the repository (at the time of the issue). The example is considered successful if all of the unit tests pass after patching the edit.

Source: [swebench.com](http://swebench.com/)
In SWE-bench, LLMs are either given the set of correct files to edit (“assisted”); or a separate system retrieves the files to edit based on similarity to the issue text (“unassisted”). As an agent, Devin does not receive any list of files and instead navigates files on its own, which is more comparable to the “unassisted” LLM.
Properly solving SWE-bench examples is challenging. The harder PRs require changing tens of files, maintaining backwards compatibility, and/or doing a lot of complex reasoning. Even when assisted, the best LLMs only achieve a 4.80% success rate.
## Methodology
We adapt SWE-bench to evaluate agents, a more general setting than the original eval for LLMs.
### Setup
- We run the agent end to end using a standardized prompt that asks it to edit code given only the GitHub issue description. We do not give the agent any other user input during the run.
- The repo is cloned in the agent's environment. We only keep the base commit and its ancestors in the git history to prevent information leakage to
... (truncated, 14 KB total)Resource ID:
58108015c409775a | Stable ID: ZjdlMDk2NW