Skip to content
Longterm Wiki
Back

Evaluating Large Language Models Trained on Code

paper

Authors

Mark Chen·Jerry Tworek·Heewoo Jun·Qiming Yuan·Henrique Ponde de Oliveira Pinto·Jared Kaplan·Harri Edwards·Yuri Burda·Nicholas Joseph·Greg Brockman·Alex Ray·Raul Puri·Gretchen Krueger·Michael Petrov·Heidy Khlaaf·Girish Sastry·Pamela Mishkin·Brooke Chan·Scott Gray·Nick Ryder·Mikhail Pavlov·Alethea Power·Lukasz Kaiser·Mohammad Bavarian·Clemens Winter·Philippe Tillet·Felipe Petroski Such·Dave Cummings·Matthias Plappert·Fotios Chantzis·Elizabeth Barnes·Ariel Herbert-Voss·William Hebgen Guss·Alex Nichol·Alex Paino·Nikolas Tezak·Jie Tang·Igor Babuschkin·Suchir Balaji·Shantanu Jain·William Saunders·Christopher Hesse·Andrew N. Carr·Jan Leike·Josh Achiam·Vedant Misra·Evan Morikawa·Alec Radford·Matthew Knight·Miles Brundage·Mira Murati·Katie Mayer·Peter Welinder·Bob McGrew·Dario Amodei·Sam McCandlish·Ilya Sutskever·Wojciech Zaremba

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

The Codex paper is foundational for understanding AI coding capabilities; it introduced HumanEval (now a standard benchmark) and is directly relevant to AI safety discussions around automated code generation, self-improvement risks, and economic displacement of knowledge workers.

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

Summary

This paper introduces Codex, a GPT-based model fine-tuned on publicly available code from GitHub, and evaluates it on HumanEval, a new benchmark for measuring functional correctness in code generation. Codex powers GitHub Copilot and demonstrates significant capability in generating Python solutions to programming problems. The paper also discusses safety considerations including misuse potential and economic impacts on software developers.

Key Points

  • Introduces HumanEval benchmark: 164 hand-crafted Python programming problems evaluated by functional correctness rather than text similarity
  • Codex (12B parameter GPT model fine-tuned on code) solves 28.8% of HumanEval problems; with repeated sampling (pass@100) reaches 70.2%
  • Demonstrates that code-generating models raise novel safety concerns including over-reliance, generation of insecure code, and potential for misuse
  • Discusses economic impacts: potential displacement of software developer tasks and dual-use implications for automating malicious code
  • Establishes foundation for understanding AI coding capability that has direct relevance to AI-assisted scientific research and recursive self-improvement concerns

Cited by 2 pages

PageTypeQuality
Autonomous CodingCapability63.0
Large Language ModelsCapability60.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Evaluating Large Language Models Trained on Code

Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Ponde de Oliveira Pinto
Jared Kaplan
Harri Edwards
Yuri Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
Felipe Petroski Such
Dave Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William Hebgen Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
Suchir Balaji
Shantanu Jain
William Saunders
Christopher Hesse
Andrew N. Carr
Jan Leike
Josh Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba

###### Abstract

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On [HumanEval](https://github.com/openai/human-eval ""), a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

Machine Learning, ICML

## 1 Introduction

Scalable sequence prediction models (Graves, [2014](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib45 ""); Vaswani et al., [2017](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib114 ""); Child et al., [2019](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib25 "")) have become a general-purpose method for generation and representation learning in many domains, including natural language processing (Mikolov et al., [2013](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib79 ""); Sutskever et al., [2014](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib109 ""); Dai & Le, [2015](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib31 ""); Peters et al., [2018](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib88 ""); Radford et al., [2018](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib93 ""); Devlin et al., [2018](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib36 "")), computer vision (Van Oord et al., [2016](https://ar5iv.labs.arxiv.org/html/2107.03374#bib.bib113 ""); Menick & Kalchbrenner, [2018](https://ar

... (truncated, 98 KB total)
Resource ID: 176fdaf24fa29d4c | Stable ID: ZGI2YTRkMG