[2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

paper

2023·arXiv·arxiv.org/abs/2312.09390

Authors

Collin Burns·Pavel Izmailov·Jan Hendrik Kirchner·Bowen Baker·Leo Gao·Leopold Aschenbrenner·Yining Chen·Adrien Ecoffet·Manas Joglekar·Jan Leike·Ilya Sutskever·Jeff Wu

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Influential OpenAI paper by Burns et al. (2023) that operationalizes the superalignment problem empirically and introduces weak-to-strong generalization as a key research paradigm for aligning models smarter than their supervisors.

Paper Details

Citations

421

54 influential

Year

2023

arXiv:2312.09390 DOI:10.48550/arXiv.2312.09390 Semantic Scholar

Metadata

Importance: 90/100arxiv preprintprimary source

Abstract

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Summary

This OpenAI paper introduces the 'weak-to-strong generalization' problem as an analogy for superalignment: can a weak supervisor (humans) elicit good behavior from a much stronger model (superintelligence)? Experiments show that strong pretrained models can generalize beyond weak labels, and simple techniques like auxiliary confidence loss can significantly improve this generalization.

Key Points

•Frames the core superalignment challenge as: can weak supervisors (humans) reliably align models far smarter than themselves?
•Uses empirically tractable setup where small models supervise large pretrained models to study weak-to-strong dynamics
•Finds that strong models do generalize beyond weak labels ('sycophancy gap' exists but models recover some true capability)
•Proposes auxiliary confidence loss and other techniques that close part of the gap between weak supervision and strong ground truth
•Identifies this as a key research agenda for scalable oversight, with significant unsolved challenges remaining before superintelligence

Cited by 4 pages

Page	Type	Quality
OpenAI	Organization	62.0
AI Alignment	Approach	91.0
RLHF	Research Area	63.0
Weak-to-Strong Generalization	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

 
 
 
Collin Burns
&Pavel Izmailov ∗ 
&Jan Hendrik Kirchner ∗ 
&Bowen Baker ∗ 
&Leo Gao ∗ 
 \AND Leopold Aschenbrenner ∗ 
&Yining Chen ∗ 
&Adrien Ecoffet ∗ 
&Manas Joglekar ∗ 
 \AND Jan Leike
&Ilya Sutskever
&Jeff Wu ∗ 
 \AND OpenAI
 Primary authors. This was a joint project of the Superalignment Generalization team. Correspondence to generalization@openai.com .
Code is available at github.com/openai/weak-to-strong .
 
 

 
 Abstract

 Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs.
However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models.
We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model?
We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks.
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization .
However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks.
Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

 
 
 
 1 Introduction

 
 We mainly steer or align today’s models with reinforcement learning from human feedback (RLHF): we reinforce behaviors that human evaluators rate highly and penalize behaviors that evaluators rate poorly  (Christiano et al., 2017 ; Stiennon et al., 2020 ; Ouyang et al., 2022 ; Glaese et al., 2022 ; Bai et al., 2022a ) .
This procedure is very effective when human evaluators can tell if model behavior is good or bad and is a core part of training modern language model assistants such as ChatGPT.

 
 
 However, superhuman models will be capable of complex and creative behaviors that humans cannot fully understand. For example, if a superhuman assistant model generates a million lines of extremely complicated code, humans will not be able to provide reliable supervision for key alignment-relevant tasks,

... (truncated, 98 KB total)

Resource ID: 0ba98ae3a8a72270 | Stable ID: sid_yL5P0ModWV