Evaluating Large Language Models Trained on Code

paper

2021·arXiv·arxiv.org/abs/2107.03374

Authors

Mark Chen·Jerry Tworek·Heewoo Jun·Qiming Yuan·Henrique Ponde de Oliveira Pinto·Jared Kaplan·Harri Edwards·Yuri Burda·Nicholas Joseph·Greg Brockman·Alex Ray·Raul Puri·Gretchen Krueger·Michael Petrov·Heidy Khlaaf·Girish Sastry·Pamela Mishkin·Brooke Chan·Scott Gray·Nick Ryder·Mikhail Pavlov·Alethea Power·Lukasz Kaiser·Mohammad Bavarian·Clemens Winter·Philippe Tillet·Felipe Petroski Such·Dave Cummings·Matthias Plappert·Fotios Chantzis·Elizabeth Barnes·Ariel Herbert-Voss·William Hebgen Guss·Alex Nichol·Alex Paino·Nikolas Tezak·Jie Tang·Igor Babuschkin·Suchir Balaji·Shantanu Jain·William Saunders·Christopher Hesse·Andrew N. Carr·Jan Leike·Josh Achiam·Vedant Misra·Evan Morikawa·Alec Radford·Matthew Knight·Miles Brundage·Mira Murati·Katie Mayer·Peter Welinder·Bob McGrew·Dario Amodei·Sam McCandlish·Ilya Sutskever·Wojciech Zaremba

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

The Codex paper is foundational for understanding AI coding capabilities; it introduced HumanEval (now a standard benchmark) and is directly relevant to AI safety discussions around automated code generation, self-improvement risks, and economic displacement of knowledge workers.

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

Summary

This paper introduces Codex, a GPT-based model fine-tuned on publicly available code from GitHub, and evaluates it on HumanEval, a new benchmark for measuring functional correctness in code generation. Codex powers GitHub Copilot and demonstrates significant capability in generating Python solutions to programming problems. The paper also discusses safety considerations including misuse potential and economic impacts on software developers.

Key Points

•Introduces HumanEval benchmark: 164 hand-crafted Python programming problems evaluated by functional correctness rather than text similarity
•Codex (12B parameter GPT model fine-tuned on code) solves 28.8% of HumanEval problems; with repeated sampling (pass@100) reaches 70.2%
•Demonstrates that code-generating models raise novel safety concerns including over-reliance, generation of insecure code, and potential for misuse
•Discusses economic impacts: potential displacement of software developer tasks and dual-use implications for automating malicious code
•Establishes foundation for understanding AI coding capability that has direct relevance to AI-assisted scientific research and recursive self-improvement concerns

Cited by 2 pages

Page	Type	Quality
Autonomous Coding	Capability	63.0
Large Language Models	Capability	60.0

Cached Content Preview

HTTP 200Fetched May 11, 202698 KB

Evaluating Large Language Models Trained on Code

 
 
 Mark Chen
 
    
 Jerry Tworek
 
    
 Heewoo Jun
 
    
 Qiming Yuan
 
    
 Henrique Ponde de Oliveira Pinto
 
    
 Jared Kaplan
 
    
 Harri Edwards
 
    
 Yuri Burda
 
    
 Nicholas Joseph
 
    
 Greg Brockman
 
    
 Alex Ray
 
    
 Raul Puri
 
    
 Gretchen Krueger
 
    
 Michael Petrov
 
    
 Heidy Khlaaf
 
    
 Girish Sastry
 
    
 Pamela Mishkin
 
    
 Brooke Chan
 
    
 Scott Gray
 
    
 Nick Ryder
 
    
 Mikhail Pavlov
 
    
 Alethea Power
 
    
 Lukasz Kaiser
 
    
 Mohammad Bavarian
 
    
 Clemens Winter
 
    
 Philippe Tillet
 
    
 Felipe Petroski Such
 
    
 Dave Cummings
 
    
 Matthias Plappert
 
    
 Fotios Chantzis
 
    
 Elizabeth Barnes
 
    
 Ariel Herbert-Voss
 
    
 William Hebgen Guss
 
    
 Alex Nichol
 
    
 Alex Paino
 
    
 Nikolas Tezak
 
    
 Jie Tang
 
    
 Igor Babuschkin
 
    
 Suchir Balaji
 
    
 Shantanu Jain
 
    
 William Saunders
 
    
 Christopher Hesse
 
    
 Andrew N. Carr
 
    
 Jan Leike
 
    
 Josh Achiam
 
    
 Vedant Misra
 
    
 Evan Morikawa
 
    
 Alec Radford
 
    
 Matthew Knight
 
    
 Miles Brundage
 
    
 Mira Murati
 
    
 Katie Mayer
 
    
 Peter Welinder
 
    
 Bob McGrew
 
    
 Dario Amodei
 
    
 Sam McCandlish
 
    
 Ilya Sutskever
 
    
 Wojciech Zaremba
 
 

 
 Abstract

 We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval , a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

 
 Machine Learning, ICML
 
 
 
 
 
 
 1 Introduction

 
 Scalable sequence prediction models ( Graves , 2014 ; Vaswani et al. , 2017 ; Child et al. , 2019 ) have become a general-purpose method for generation and representation learning in many domains, including natural language processing ( Mikolov et al. , 2013 ; Sutskever et al. , 2014 ; Dai & Le , 2015 ; Peters et al. , 2018 ; Radford et al. , 2018 ; Devlin et al. , 2018 ) , computer vision ( Van Oord et al. , 2016 ; Menick & Kalchbrenner , 2018 ; Chen et al. , 2020 ; Bao et al. , 2021 ) , audio and speech processing ( Oord et al. , 2016 , 2018 ; Dhariwal et al. , 2020 ; Baevski et al. , 2020 ) , biology ( All

... (truncated, 98 KB total)

Resource ID: 176fdaf24fa29d4c | Stable ID: sid_YRD4cXuG7h