Skip to content
Longterm Wiki
Back

First autonomous coding agent

web

Landmark capability demonstration from early 2024 showing autonomous AI agents can resolve real-world software engineering tasks; relevant for tracking AI capability progression and understanding benchmark design for agentic systems.

Metadata

Importance: 62/100blog postprimary source

Summary

Cognition AI's technical report on Devin, their autonomous coding agent, which achieved 13.86% on the SWE-bench benchmark—far exceeding the previous unassisted baseline of 1.96%. The report details their evaluation methodology for adapting SWE-bench from LLM evaluation to agent evaluation, including how Devin navigates codebases independently without file hints.

Key Points

  • Devin resolved 13.86% of SWE-bench issues unassisted, vs. previous best of 1.96% unassisted and 4.80% assisted by top LLMs.
  • SWE-bench tests real-world software engineering via GitHub issues/PRs with deterministic unit test evaluation across 2,294 instances.
  • Methodology adapted SWE-bench for agents: no file hints given, 45-minute runtime limit, agent navigates repo autonomously.
  • Evaluation harness resets test files post-run to prevent agents from gaming tests, then applies agent's patch before running tests.
  • Represents a significant capability milestone for autonomous AI agents performing complex, multi-file software engineering tasks.

Cited by 1 page

PageTypeQuality
Agentic AICapability68.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202614 KB
[Blog](https://cognition.ai/blog/1)/ [Research](https://cognition.ai/blog/Research/1)

March 15, 2024

# SWE-bench technical report

by The Cognition Team

In this article:

To evaluate Devin, we turn to [SWE-bench](https://www.swebench.com/), an automated benchmark for software engineering systems consisting of GitHub issues and pull requests. We believe SWE-bench is a great choice because it deterministically evaluates (via unit tests) a system’s ability to solve issues in real world codebases, unlike benchmarks like [HumanEval](https://github.com/openai/human-eval) which are limited to standalone functions.

In SWE-bench, Devin successfully resolves 13.86%\* of issues, far exceeding the previous highest unassisted baseline of 1.96%. Even when given the exact files to edit ("assisted"), the best previous model only resolves 4.80% of issues.

We provide our evaluation harness and Devin’s code edits at [https://github.com/CognitionAI/devin-swebench-results](https://github.com/CognitionAI/devin-swebench-results).

## Background

SWE-bench is a dataset of 2,294 issues and pull requests scraped from popular open source Python repositories on GitHub. Its goal is to test a system’s ability to write real-world code.

Each SWE-bench instance consists of a GitHub issue and the pull request which resolved it. The pull request must include a unit test which fails before the code change and passes after (called a “fail to pass” test). The diff is split into two parts, patch and test\_patch, which contain the code changes and the test changes, respectively.

Then the system being evaluated is asked to generate a diff given the GitHub issue description and the repository (at the time of the issue). The example is considered successful if all of the unit tests pass after patching the edit.

![](https://cdn.sanity.io/images/2mc9cv2v/production/73a35aaa9d9ab3f0cd1469ff5cf0e83146c354f7-1600x376.png)

Source: [swebench.com](http://swebench.com/)

In SWE-bench, LLMs are either given the set of correct files to edit (“assisted”); or a separate system retrieves the files to edit based on similarity to the issue text (“unassisted”). As an agent, Devin does not receive any list of files and instead navigates files on its own, which is more comparable to the “unassisted” LLM.

Properly solving SWE-bench examples is challenging. The harder PRs require changing tens of files, maintaining backwards compatibility, and/or doing a lot of complex reasoning. Even when assisted, the best LLMs only achieve a 4.80% success rate.

## Methodology

We adapt SWE-bench to evaluate agents, a more general setting than the original eval for LLMs.

### Setup

- We run the agent end to end using a standardized prompt that asks it to edit code given only the GitHub issue description. We do not give the agent any other user input during the run.
- The repo is cloned in the agent's environment. We only keep the base commit and its ancestors in the git history to prevent information leakage to 

... (truncated, 14 KB total)
Resource ID: 58108015c409775a | Stable ID: ZjdlMDk2NW