Skip to content
Longterm Wiki
Back

Measuring Progress on Scalable Oversight for Large Language Models

paper

Authors

Samuel R. Bowman·Jeeyoon Hyun·Ethan Perez·Edwin Chen·Craig Pettit·Scott Heiner·Kamilė Lukošiūtė·Amanda Askell·Andy Jones·Anna Chen·Anna Goldie·Azalia Mirhoseini·Cameron McKinnon·Christopher Olah·Daniela Amodei·Dario Amodei·Dawn Drain·Dustin Li·Eli Tran-Johnson·Jackson Kernion·Jamie Kerr·Jared Mueller·Jeffrey Ladish·Joshua Landau·Kamal Ndousse·Liane Lovitt·Nelson Elhage·Nicholas Schiefer·Nicholas Joseph·Noemí Mercado·Nova DasSarma·Robin Larson·Sam McCandlish·Sandipan Kundu·Scott Johnston·Shauna Kravec·Sheer El Showk·Stanislav Fort·Timothy Telleen-Lawton·Tom Brown·Tom Henighan·Tristan Hume·Yuntao Bai·Zac Hatfield-Dodds·Ben Mann·Jared Kaplan

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical research on scalable oversight for large language models, addressing how to supervise AI systems that may exceed human capabilities—a critical challenge for AI safety and alignment.

Paper Details

Citations
187
14 influential
Year
2022
Methodology
peer-reviewed
Categories
Ingénierie des systèmes d information

Metadata

arxiv preprintprimary source

Abstract

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.

Summary

This paper addresses scalable oversight—the challenge of supervising AI systems that may exceed human capabilities—by proposing an empirical framework to study it with current models. The authors design experiments using tasks where human specialists succeed but unaided humans and current AI systems fail, then demonstrate a proof-of-concept on MMLU and QuALITY tasks. They find that humans interacting with an unreliable large language model through chat substantially outperform both the model alone and their unaided performance, suggesting that scalable oversight is tractable to study empirically and that LLMs can effectively assist humans on difficult tasks.

Cited by 1 page

PageTypeQuality
Scalable OversightResearch Area68.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202683 KB
# Measuring Progress on Scalable Oversight   for Large Language Models

Samuel R. Bowman,  Jeeyoon Hyun, Ethan Perez
Correspondence to: {sambowman,jared}@anthropic.com

The first and third blocks of authors are core contributors. Author contributions are detailed in § [Author Contributions](https://ar5iv.labs.arxiv.org/html/2211.03540#Sx1 "Author Contributions ‣ Measuring Progress on Scalable Oversight for Large Language Models"). All authors conducted this work while at Anthropic except where noted.Edwin Chen,† Craig Pettit,† Scott Heiner,† Kamilė Lukošiūtė,‡&Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,Jack Clark,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,Kamal Ndousse,
Liane Lovitt,
Nelson Elhage,
Nicholas Schiefer, Nicholas Joseph,
Noemí Mercado,Nova DasSarma,
Robin Larson,
Sam McCandlish,
Sandipan Kundu,
Scott Johnston,Shauna Kravec,
Sheer El Showk,
Stanislav Fort,
Timothy Telleen-Lawton,
Tom Brown,Tom Henighan,
Tristan Hume,
Yuntao Bai,
Zac Hatfield-Dodds,\\ANDBen Mann, and Jared Kaplan††footnotemark: \\ANDAnthropic, †Surge AI, ‡Independent Researcher

###### Abstract

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat—a trivial baseline strategy for scalable oversight—substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.

## 1 Introduction

To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., [2016](https://ar5iv.labs.arxiv.org/html/2211.03540#bib.bib1 "")). These techniques a

... (truncated, 83 KB total)
Resource ID: b0f6f129f201e4dc | Stable ID: NTZjNGJkMj