Measuring Progress on Scalable Oversight for Large Language Models

paper

2022·arXiv·arxiv.org/abs/2211.03540

Authors

Samuel R. Bowman·Jeeyoon Hyun·Ethan Perez·Edwin Chen·Craig Pettit·Scott Heiner·Kamilė Lukošiūtė·Amanda Askell·Andy Jones·Anna Chen·Anna Goldie·Azalia Mirhoseini·Cameron McKinnon·Christopher Olah·Daniela Amodei·Dario Amodei·Dawn Drain·Dustin Li·Eli Tran-Johnson·Jackson Kernion·Jamie Kerr·Jared Mueller·Jeffrey Ladish·Joshua Landau·Kamal Ndousse·Liane Lovitt·Nelson Elhage·Nicholas Schiefer·Nicholas Joseph·Noemí Mercado·Nova DasSarma·Robin Larson·Sam McCandlish·Sandipan Kundu·Scott Johnston·Shauna Kravec·Sheer El Showk·Stanislav Fort·Timothy Telleen-Lawton·Tom Brown·Tom Henighan·Tristan Hume·Yuntao Bai·Zac Hatfield-Dodds·Ben Mann·Jared Kaplan

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical research on scalable oversight for large language models, addressing how to supervise AI systems that may exceed human capabilities—a critical challenge for AI safety and alignment.

Paper Details

Citations

187

14 influential

Year

2022

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.

Summary

This paper addresses scalable oversight—the challenge of supervising AI systems that may exceed human capabilities—by proposing an empirical framework to study it with current models. The authors design experiments using tasks where human specialists succeed but unaided humans and current AI systems fail, then demonstrate a proof-of-concept on MMLU and QuALITY tasks. They find that humans interacting with an unreliable large language model through chat substantially outperform both the model alone and their unaided performance, suggesting that scalable oversight is tractable to study empirically and that LLMs can effectively assist humans on difficult tasks.

Cited by 1 page

Page	Type	Quality
Scalable Oversight	Research Area	68.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202679 KB

[2211.03540] Measuring Progress on Scalable Oversight for Large Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Measuring Progress on Scalable Oversight
 for Large Language Models

 
 
 
Samuel R. Bowman,  Jeeyoon Hyun, Ethan Perez
 Correspondence to: {sambowman,jared}@anthropic.com 
 The first and third blocks of authors are core contributors. Author contributions are detailed in § Author Contributions . All authors conducted this work while at Anthropic except where noted. 
    
 Edwin Chen, † Craig Pettit, † Scott Heiner, † Kamilė Lukošiūtė, ‡ & Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon, 
 
    
 Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson, 
 
    
 Jack Clark,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau, 
 
    
 Kamal Ndousse,
Liane Lovitt,
Nelson Elhage,
Nicholas Schiefer, Nicholas Joseph,
Noemí Mercado, 
 
    
 Nova DasSarma,
Robin Larson,
Sam McCandlish,
Sandipan Kundu,
Scott Johnston, 
 
    
 Shauna Kravec,
Sheer El Showk,
Stanislav Fort,
Timothy Telleen-Lawton,
Tom Brown, 
 
    
 Tom Henighan,
Tristan Hume,
Yuntao Bai,
Zac Hatfield-Dodds, \AND Ben Mann, and Jared Kaplan † † footnotemark: \AND Anthropic, † Surge AI, ‡ Independent Researcher 
 
 
 

 
 Abstract

 Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight : the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat—a trivial baseline strategy for scalable oversight—substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks. 

 
 
 
 1 Introduction

 
 To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight : the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., 2016 ) 

... (truncated, 79 KB total)

Resource ID: b0f6f129f201e4dc | Stable ID: sid_ggnwDaNX9I