Measuring Progress on Scalable Oversight for Large Language Models
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Empirical research on scalable oversight for large language models, addressing how to supervise AI systems that may exceed human capabilities—a critical challenge for AI safety and alignment.
Paper Details
Metadata
Abstract
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
Summary
This paper addresses scalable oversight—the challenge of supervising AI systems that may exceed human capabilities—by proposing an empirical framework to study it with current models. The authors design experiments using tasks where human specialists succeed but unaided humans and current AI systems fail, then demonstrate a proof-of-concept on MMLU and QuALITY tasks. They find that humans interacting with an unreliable large language model through chat substantially outperform both the model alone and their unaided performance, suggesting that scalable oversight is tractable to study empirically and that LLMs can effectively assist humans on difficult tasks.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Scalable Oversight | Research Area | 68.0 |
Cached Content Preview
# Measuring Progress on Scalable Oversight for Large Language Models
Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez
Correspondence to: {sambowman,jared}@anthropic.com
The first and third blocks of authors are core contributors. Author contributions are detailed in § [Author Contributions](https://ar5iv.labs.arxiv.org/html/2211.03540#Sx1 "Author Contributions ‣ Measuring Progress on Scalable Oversight for Large Language Models"). All authors conducted this work while at Anthropic except where noted.Edwin Chen,† Craig Pettit,† Scott Heiner,† Kamilė Lukošiūtė,‡&Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,Jack Clark,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,Kamal Ndousse,
Liane Lovitt,
Nelson Elhage,
Nicholas Schiefer, Nicholas Joseph,
Noemí Mercado,Nova DasSarma,
Robin Larson,
Sam McCandlish,
Sandipan Kundu,
Scott Johnston,Shauna Kravec,
Sheer El Showk,
Stanislav Fort,
Timothy Telleen-Lawton,
Tom Brown,Tom Henighan,
Tristan Hume,
Yuntao Bai,
Zac Hatfield-Dodds,\\ANDBen Mann, and Jared Kaplan††footnotemark: \\ANDAnthropic, †Surge AI, ‡Independent Researcher
###### Abstract
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat—a trivial baseline strategy for scalable oversight—substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
## 1 Introduction
To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., [2016](https://ar5iv.labs.arxiv.org/html/2211.03540#bib.bib1 "")). These techniques a
... (truncated, 83 KB total)b0f6f129f201e4dc | Stable ID: NTZjNGJkMj