[2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Influential OpenAI paper by Burns et al. (2023) that operationalizes the superalignment problem empirically and introduces weak-to-strong generalization as a key research paradigm for aligning models smarter than their supervisors.
Paper Details
Metadata
Abstract
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
Summary
This OpenAI paper introduces the 'weak-to-strong generalization' problem as an analogy for superalignment: can a weak supervisor (humans) elicit good behavior from a much stronger model (superintelligence)? Experiments show that strong pretrained models can generalize beyond weak labels, and simple techniques like auxiliary confidence loss can significantly improve this generalization.
Key Points
- •Frames the core superalignment challenge as: can weak supervisors (humans) reliably align models far smarter than themselves?
- •Uses empirically tractable setup where small models supervise large pretrained models to study weak-to-strong dynamics
- •Finds that strong models do generalize beyond weak labels ('sycophancy gap' exists but models recover some true capability)
- •Proposes auxiliary confidence loss and other techniques that close part of the gap between weak supervision and strong ground truth
- •Identifies this as a key research agenda for scalable oversight, with significant unsolved challenges remaining before superintelligence
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| OpenAI | Organization | 62.0 |
| AI Alignment | Approach | 91.0 |
| RLHF | Research Area | 63.0 |
| Weak-to-Strong Generalization | Approach | 91.0 |
Cached Content Preview
# Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
&Pavel Izmailov∗
&Jan Hendrik Kirchner∗
&Bowen Baker∗
&Leo Gao∗\\ANDLeopold Aschenbrenner∗
&Yining Chen∗
&Adrien Ecoffet∗
&Manas Joglekar∗\\ANDJan Leike
&Ilya Sutskever
&Jeff Wu∗\\ANDOpenAI
Primary authors. This was a joint project of the Superalignment Generalization team. Correspondence to generalization@openai.com.
Code is available at [github.com/openai/weak-to-strong](https://github.com/openai/weak-to-strong "").
###### Abstract
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs.
However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models.
We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model?
We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks.
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization.
However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks.
Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
## 1 Introduction
We mainly steer or align today’s models with reinforcement learning from human feedback (RLHF): we reinforce behaviors that human evaluators rate highly and penalize behaviors that evaluators rate poorly (Christiano et al., [2017](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib26 ""); Stiennon et al., [2020](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib113 ""); Ouyang et al., [2022](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib86 ""); Glaese et al., [2022](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib39 ""); Bai et al., [2022a](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib5 "")).
This procedure is very effective when human evaluators can tell if model behavior is good or bad and is a core part of training modern language model assistants such as ChatGPT.
However, superhuman models will be capable of complex and creative behaviors that humans cannot fully understand. For example, if a superhuman
... (truncated, 98 KB total)0ba98ae3a8a72270 | Stable ID: ZTgxY2EzZm