First autonomous coding agent

web

cognition.ai·cognition.ai/blog/swe-bench-technical-report

Landmark capability demonstration from early 2024 showing autonomous AI agents can resolve real-world software engineering tasks; relevant for tracking AI capability progression and understanding benchmark design for agentic systems.

Metadata

Importance: 62/100blog postprimary source

Summary

Cognition AI's technical report on Devin, their autonomous coding agent, which achieved 13.86% on the SWE-bench benchmark—far exceeding the previous unassisted baseline of 1.96%. The report details their evaluation methodology for adapting SWE-bench from LLM evaluation to agent evaluation, including how Devin navigates codebases independently without file hints.

Key Points

•Devin resolved 13.86% of SWE-bench issues unassisted, vs. previous best of 1.96% unassisted and 4.80% assisted by top LLMs.
•SWE-bench tests real-world software engineering via GitHub issues/PRs with deterministic unit test evaluation across 2,294 instances.
•Methodology adapted SWE-bench for agents: no file hints given, 45-minute runtime limit, agent navigates repo autonomously.
•Evaluation harness resets test files post-run to prevent agents from gaming tests, then applies agent's patch before running tests.
•Represents a significant capability milestone for autonomous AI agents performing complex, multi-file software engineering tasks.

Cited by 1 page

Page	Type	Quality
Agentic AI	Capability	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202611 KB

Cognition | SWE-bench technical report Blog / Research March 15, 2024 SWE-bench technical report

 by The Cognition Team In this article: To evaluate Devin, we turn to SWE-bench , an automated benchmark for software engineering systems consisting of GitHub issues and pull requests. We believe SWE-bench is a great choice because it deterministically evaluates (via unit tests) a system’s ability to solve issues in real world codebases, unlike benchmarks like HumanEval which are limited to standalone functions.

 In SWE-bench, Devin successfully resolves 13.86%* of issues, far exceeding the previous highest unassisted baseline of 1.96%. Even when given the exact files to edit ("assisted"), the best previous model only resolves 4.80% of issues.

 We provide our evaluation harness and Devin’s code edits at https://github.com/CognitionAI/devin-swebench-results .

 Background

 SWE-bench is a dataset of 2,294 issues and pull requests scraped from popular open source Python repositories on GitHub. Its goal is to test a system’s ability to write real-world code.

 Each SWE-bench instance consists of a GitHub issue and the pull request which resolved it. The pull request must include a unit test which fails before the code change and passes after (called a “fail to pass” test). The diff is split into two parts, patch and test_patch, which contain the code changes and the test changes, respectively.

 Then the system being evaluated is asked to generate a diff given the GitHub issue description and the repository (at the time of the issue). The example is considered successful if all of the unit tests pass after patching the edit.

 Source: swebench.com 

 In SWE-bench, LLMs are either given the set of correct files to edit (“assisted”); or a separate system retrieves the files to edit based on similarity to the issue text (“unassisted”). As an agent, Devin does not receive any list of files and instead navigates files on its own, which is more comparable to the “unassisted” LLM.

 Properly solving SWE-bench examples is challenging. The harder PRs require changing tens of files, maintaining backwards compatibility, and/or doing a lot of complex reasoning. Even when assisted, the best LLMs only achieve a 4.80% success rate.

 Methodology

 We adapt SWE-bench to evaluate agents, a more general setting than the original eval for LLMs.

 Setup

 We run the agent end to end using a standardized prompt that asks it to edit code given only the GitHub issue description. We do not give the agent any other user input during the run.
 The repo is cloned in the agent&#x27;s environment. We only keep the base commit and its ancestors in the git history to prevent information leakage to the agent. Notably, we remove the git remote so that git pull does not work.
 We set up the Python conda environment before the test starts.
 We limit Devin to 45 minutes of runtime, as unlike most agents, it has the capability to run indefinitely. It can choose to terminate earlier if it

... (truncated, 11 KB total)

Resource ID: 58108015c409775a | Stable ID: sid_FluwJ3eR3I