Skip to content
Longterm Wiki
Back

[2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

paper

Authors

Collin Burns·Pavel Izmailov·Jan Hendrik Kirchner·Bowen Baker·Leo Gao·Leopold Aschenbrenner·Yining Chen·Adrien Ecoffet·Manas Joglekar·Jan Leike·Ilya Sutskever·Jeff Wu

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Influential OpenAI paper by Burns et al. (2023) that operationalizes the superalignment problem empirically and introduces weak-to-strong generalization as a key research paradigm for aligning models smarter than their supervisors.

Paper Details

Citations
421
54 influential
Year
2023

Metadata

Importance: 90/100arxiv preprintprimary source

Abstract

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Summary

This OpenAI paper introduces the 'weak-to-strong generalization' problem as an analogy for superalignment: can a weak supervisor (humans) elicit good behavior from a much stronger model (superintelligence)? Experiments show that strong pretrained models can generalize beyond weak labels, and simple techniques like auxiliary confidence loss can significantly improve this generalization.

Key Points

  • Frames the core superalignment challenge as: can weak supervisors (humans) reliably align models far smarter than themselves?
  • Uses empirically tractable setup where small models supervise large pretrained models to study weak-to-strong dynamics
  • Finds that strong models do generalize beyond weak labels ('sycophancy gap' exists but models recover some true capability)
  • Proposes auxiliary confidence loss and other techniques that close part of the gap between weak supervision and strong ground truth
  • Identifies this as a key research agenda for scalable oversight, with significant unsolved challenges remaining before superintelligence

Cited by 4 pages

PageTypeQuality
OpenAIOrganization62.0
AI AlignmentApproach91.0
RLHFResearch Area63.0
Weak-to-Strong GeneralizationApproach91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns
&Pavel Izmailov∗
&Jan Hendrik Kirchner∗
&Bowen Baker∗
&Leo Gao∗\\ANDLeopold Aschenbrenner∗
&Yining Chen∗
&Adrien Ecoffet∗
&Manas Joglekar∗\\ANDJan Leike
&Ilya Sutskever
&Jeff Wu∗\\ANDOpenAI
Primary authors. This was a joint project of the Superalignment Generalization team. Correspondence to generalization@openai.com.
Code is available at [github.com/openai/weak-to-strong](https://github.com/openai/weak-to-strong "").

###### Abstract

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs.
However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models.
We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model?
We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks.
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization.
However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks.
Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

## 1 Introduction

We mainly steer or align today’s models with reinforcement learning from human feedback (RLHF): we reinforce behaviors that human evaluators rate highly and penalize behaviors that evaluators rate poorly (Christiano et al., [2017](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib26 ""); Stiennon et al., [2020](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib113 ""); Ouyang et al., [2022](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib86 ""); Glaese et al., [2022](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib39 ""); Bai et al., [2022a](https://ar5iv.labs.arxiv.org/html/2312.09390#bib.bib5 "")).
This procedure is very effective when human evaluators can tell if model behavior is good or bad and is a core part of training modern language model assistants such as ChatGPT.

However, superhuman models will be capable of complex and creative behaviors that humans cannot fully understand. For example, if a superhuman 

... (truncated, 98 KB total)
Resource ID: 0ba98ae3a8a72270 | Stable ID: ZTgxY2EzZm