Iterated Distillation and Amplification

paper

2018·arXiv·arxiv.org/abs/1810.08575

Authors

Paul Christiano·Buck Shlegeris·Dario Amodei

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This is the original paper formalizing Iterated Amplification (IDA), a key technique in scalable oversight research developed at Anthropic/OpenAI; it is frequently cited alongside debate and recursive reward modeling as a core approach to aligning superhuman AI systems.

Paper Details

Citations

149

10 influential

Year

2018

arXiv:1810.08575 Semantic Scholar

Metadata

Importance: 85/100arxiv preprintprimary source

Abstract

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.

Summary

This paper introduces Iterated Amplification (IDA), a training strategy that builds up training signals for complex tasks by recursively decomposing hard problems into easier subproblems humans can evaluate and combining their solutions. The approach avoids the need for external reward functions or direct human evaluation of complex tasks. Empirical results in algorithmic environments demonstrate that IDA can efficiently learn complex behaviors.

Key Points

•Proposes Iterated Amplification as a solution to training AI on tasks too complex for humans to directly evaluate or specify objectives for.
•Progressively builds training signals by decomposing hard problems into easier subproblems, combining sub-solutions to approximate a human's judgment on difficult tasks.
•Closely related to Expert Iteration (AlphaGo-style self-play) but removes reliance on any external reward function, making it applicable to open-ended tasks.
•Demonstrates empirical results in algorithmic environments showing IDA can efficiently learn complex behaviors without direct human feedback on final outputs.
•Represents a foundational approach in scalable oversight, aiming to maintain alignment as AI systems tackle tasks beyond direct human comprehension.

Cited by 2 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Paul Christiano	Person	39.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202656 KB

[1810.08575] Supervising strong learners by amplifying weak experts 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Supervising strong learners
 by amplifying weak experts

 
 
 
Paul Christiano 
 OpenAI 
 paul@openai.com 
 &Buck Shlegeris 
 bshlegeris@gmail.com 
 &Dario Amodei 
 OpenAI 
 damodei@openai.com 
 
 Work done while at OpenAI. 
 

 
 Abstract

 Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017 ; Silver et al., 2017b ) , except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.

 
 
 
 1 Introduction

 
 If we want to train an ML system to perform a task, we need to be able to evaluate how well it is doing. Whether our training signal takes the form of labels, rewards, or something else entirely, we need some way to generate that signal.

 
 
 If our goal can be evaluated automatically,
such as winning a game of Go,
or if we have an algorithm that can generate examples of correct behavior, then generating a training signal is trivial.
In these cases we might say that there is an “algorithmic” training signal.

 
 
 Unfortunately, most useful tasks don’t have an algorithmic training signal.
So in current applications of machine learning, humans often provide the training signal.
This can be done by having a human demonstrate the task, for example labeling an image or teleoperating a robot,
or by learning a reward function from human judgments.
For these classes of tasks, we could say there is a “human” training signal.

 
 
 However, there are harder tasks for which we can’t compute demonstrations or rewards even with human assistance,
and for which we currently have no clear method to get a meaningful training signal.
Consider making economic policy decisions,
advancing the scientific frontier,
or managing the security of a large network of computers.
Some of these tasks are “beyond human scale” – a single human
can’t perform them and can’t make sense of their massive observation space well enough to judge the behavior of an agent.
It may be possible for a human to judge performance in the very long run
(for example, by looking at economic growth over several years),
but such long-term feedback is very slow to learn from.
We currently have no way to learn how to perform such tasks much better than a human.

 
 
 The overall situation is depicted in Table 1, which sh

... (truncated, 56 KB total)

Resource ID: f0980ca7010a4a44 | Stable ID: sid_j4htKfHQOe