Iterated Distillation and Amplification
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This is the original paper formalizing Iterated Amplification (IDA), a key technique in scalable oversight research developed at Anthropic/OpenAI; it is frequently cited alongside debate and recursive reward modeling as a core approach to aligning superhuman AI systems.
Paper Details
Metadata
Abstract
Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.
Summary
This paper introduces Iterated Amplification (IDA), a training strategy that builds up training signals for complex tasks by recursively decomposing hard problems into easier subproblems humans can evaluate and combining their solutions. The approach avoids the need for external reward functions or direct human evaluation of complex tasks. Empirical results in algorithmic environments demonstrate that IDA can efficiently learn complex behaviors.
Key Points
- •Proposes Iterated Amplification as a solution to training AI on tasks too complex for humans to directly evaluate or specify objectives for.
- •Progressively builds training signals by decomposing hard problems into easier subproblems, combining sub-solutions to approximate a human's judgment on difficult tasks.
- •Closely related to Expert Iteration (AlphaGo-style self-play) but removes reliance on any external reward function, making it applicable to open-ended tasks.
- •Demonstrates empirical results in algorithmic environments showing IDA can efficiently learn complex behaviors without direct human feedback on final outputs.
- •Represents a foundational approach in scalable oversight, aiming to maintain alignment as AI systems tackle tasks beyond direct human comprehension.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| AI Accident Risk Cruxes | Crux | 67.0 |
| Paul Christiano | Person | 39.0 |
Cached Content Preview
# Supervising strong learners by amplifying weak experts
Paul Christiano
OpenAI
paul@openai.com
&Buck Shlegeris
bshlegeris@gmail.com
&Dario Amodei
OpenAI
damodei@openai.com
Work done while at OpenAI.
###### Abstract
Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., [2017](https://ar5iv.labs.arxiv.org/html/1810.08575#bib.bib4 ""); Silver et al., [2017b](https://ar5iv.labs.arxiv.org/html/1810.08575#bib.bib22 "")), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.
## 1 Introduction
If we want to train an ML system to perform a task, we need to be able to evaluate how well it is doing. Whether our training signal takes the form of labels, rewards, or something else entirely, we need some way to generate that signal.
If our goal can be evaluated automatically,
such as winning a game of Go,
or if we have an algorithm that can generate examples of correct behavior, then generating a training signal is trivial.
In these cases we might say that there is an “algorithmic” training signal.
Unfortunately, most useful tasks don’t have an algorithmic training signal.
So in current applications of machine learning, humans often provide the training signal.
This can be done by having a human demonstrate the task, for example labeling an image or teleoperating a robot,
or by learning a reward function from human judgments.
For these classes of tasks, we could say there is a “human” training signal.
However, there are harder tasks for which we can’t compute demonstrations or rewards even with human assistance,
and for which we currently have no clear method to get a meaningful training signal.
Consider making economic policy decisions,
advancing the scientific frontier,
or managing the security of a large network of computers.
Some of these tasks are “beyond human scale” – a single human
can’t perform them and can’t make sense of their massive observation space well enough to judge the behavior of an agent.
It may be possible for a human to judge performance in the very long run
(for example, by looking at economic growth over several years),
but such long-term feedback is very slow to learn from.
We currently have no way to learn how to perform such tasks much better than a human.
The overall situation is depicted in Table 1, which shows six different combinations o
... (truncated, 59 KB total)f0980ca7010a4a44 | Stable ID: YmI4MTA3Mj