OpenAI's iterated amplification work
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
This OpenAI blog post presents early empirical work on iterated amplification, a key scalable oversight proposal by Paul Christiano, relevant to anyone studying techniques for supervising AI systems beyond direct human evaluative capacity.
Metadata
Summary
OpenAI introduces iterated amplification, a scalable oversight technique where a human-AI system is progressively amplified through decomposition of complex tasks into simpler subproblems, enabling AI to learn goals that would otherwise be too difficult for humans to evaluate directly. The approach aims to maintain alignment even as AI capabilities scale beyond direct human oversight. It represents a core research direction for training AI systems on tasks where human feedback alone is insufficient.
Key Points
- •Iterated amplification decomposes hard tasks into easier subtasks, allowing humans to provide effective oversight by combining answers to simpler questions.
- •The technique aims to scale human supervision without sacrificing alignment, addressing the core challenge of evaluating superhuman AI performance.
- •It is closely related to Paul Christiano's theoretical work and serves as an empirical test of the amplification+distillation training loop.
- •The method alternates between amplification (human + AI assistant) and distillation (training a new model to match the amplified system) iteratively.
- •Iterated amplification is positioned as complementary to debate as a mechanism for scalable oversight of advanced AI systems.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Scalable Oversight | Research Area | 68.0 |
Cached Content Preview
Learning complex goals with iterated amplification \| OpenAI
October 22, 2018
[Publication](https://openai.com/research/index/publication/)
# Learning complex goals with iterated amplification
[Read paper(opens in a new window)](https://arxiv.org/abs/1810.08575)

Listen to article
Share
We’re proposing an AI safety technique called iterated amplification that lets us specify complicated behaviors and goals that are beyond human scale, by demonstrating how to decompose a task into simpler sub-tasks, rather than by providing labeled data or a reward function. Although this idea is in its very early stages and we have only completed experiments on simple toy algorithmic domains, we’ve decided to present it in its preliminary state because we think it could prove to be a scalable approach to AI safety.
If we want to train an ML system to perform a task, we need a training signal—a way to evaluate how well it is doing in order to help it learn. For example, labels in supervised learning or rewards in reinforcement learning are training signals. The formalism of ML usually assumes a training signal is already present and focuses on learning from it, but in reality the training signal has to come from somewhere. If we don’t have a training signal we can’t learn the task, and if we have the wrong training signal, we can get unintended and [sometimes(opens in a new window)](https://arxiv.org/abs/1803.03453) [dangerous](https://openai.com/index/faulty-reward-functions/) [behavior(opens in a new window)](https://arxiv.org/abs/1705.08417). Thus, it would be valuable for both learning new tasks, and for AI safety, to improve our ability to generate training signals.
How do we currently generate training signals? Sometimes, the goal we want can be evaluated algorithmically, like counting up the score in a game of Go or checking whether a set of numbers has been successfully sorted (left panels of figure below). Most real-world tasks don’t lend themselves to an algorithmic training signal, but often we can instead obtain a training signal by having a human either [perform(opens in a new window)](https://arxiv.org/abs/1603.00448) [the(opens in a new window)](https://ai.stanford.edu/~ang/papers/icml04-apprentice.pdf) [task(opens in a new window)](https://arxiv.org/abs/1710.04615) (for example, labeling a training set or demonstrating an RL task), or [judge(opens in a new window)](https://dl.acm.org/citation.cfm?id=1597738%5D) [an(opens in a new window)](https://arxiv.org/abs/1208.0984) [AI’s(opens in a new window)](https://arxiv.org/abs/1701.06049) [performance](https://openai.com/index/learning-from-human-preferences/) on the task (middle panels of figure below). However, many tasks are so complicated that a human can’t judge or perform them—examples
... (truncated, 10 KB total)ca07d6bcd57e7027 | Stable ID: MzI2YWRhZW