OpenAI's alignment research

web

OpenAI·openai.com/research/weak-to-strong-generalization

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Foundational OpenAI Superalignment team paper (Dec 2023) introducing weak-to-strong generalization as a key research paradigm; highly relevant to scalable oversight and the broader challenge of aligning superintelligent systems.

Metadata

Importance: 88/100blog postprimary source

Summary

OpenAI's Superalignment team introduces a research paradigm for tackling superintelligence alignment by studying whether weak models can supervise stronger ones. They demonstrate that a GPT-2-level supervisor can elicit near GPT-3.5-level performance from GPT-4, showing that strong pretrained models can generalize beyond their weak supervisor's limitations. This provides an empirically tractable analogy for the core challenge of humans supervising superhuman AI.

Key Points

•Proposes weak-to-strong generalization as a tractable empirical analogy for the superalignment problem, where humans must supervise AI smarter than themselves.
•Demonstrates GPT-2-level models can supervise GPT-4 to achieve close to GPT-3.5-level performance, even on hard problems the weak model fails at.
•Strong pretrained models leverage latent knowledge to generalize beyond flawed/incomplete weak supervisor labels, rather than simply imitating errors.
•Introduces a scalable research direction allowing iterative empirical progress on superalignment challenges before superintelligence is developed.
•Current alignment methods like RLHF rely on human supervision that will break down when AI can produce outputs too complex for humans to evaluate.

Cited by 3 pages

Page	Type	Quality
Large Language Models	Capability	60.0
AI Accident Risk Cruxes	Crux	67.0
Goal Misgeneralization Probability Model	Analysis	61.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20267 KB

Weak-to-strong generalization | OpenAI OpenAI December 14, 2023

 Safety Weak-to-strong generalization

 Read paper (opens in a new window) Justin Jay Wang × DALL·E 

 Loading… Share We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak supervisors? 

 A core challenge for aligning future superhuman AI systems (superalignment) is that humans will need to supervise AI systems much smarter than them. We study a simple analogy: can small models supervise large models? We show that we can use a GPT‑2‑level model to elicit most of GPT‑4’s capabilities—close to GPT‑3.5‑level performance—generalizing correctly even to hard problems where the small model failed. This opens up a new research direction that allows us to directly tackle a central challenge of aligning future superhuman models while making iterative empirical progress today. 

 The superalignment problem

 We believe superintelligence—AI vastly smarter than humans—could be developed within the next ten years. However, we still do not know how to reliably steer and control superhuman AI systems. Solving this problem is essential for ensuring that even the most advanced AI systems in the future remain safe and beneficial to humanity.  

 We formed the Superalignment team ⁠ earlier this year to solve this problem of superintelligence alignment. Today, we are releasing the team’s first paper, which introduces a new research direction for empirically aligning superhuman models. 

 Current alignment methods, such as reinforcement learning from human feedback (RLHF), rely on human supervision. However, future AI systems will be capable of extremely complex and creative behaviors that will make it hard for humans to reliably supervise them. For example, superhuman models may be able to write millions of lines of novel—and potentially dangerous—computer code that would be very hard even for expert humans to understand.  

 Relative to superhuman AI models, humans will be “weak supervisors.” This is a core challenge for AGI alignment: how can weak supervisors trust and control substantially stronger models? 

 Our setup

 To make progress on this core challenge, we propose an analogy we can empirically study today: can we use a smaller (less capable) model to supervise a larger (more capable) model? 

 A simple analogy for superalignment: In traditional machine learning (ML), humans supervise AI systems weaker than themselves (left). To align superintelligence, humans will instead need to supervise AI systems smarter than them (center). We cannot directly study this problem today, but we can study a simple analogy: can small models supervise larger models (right)? 

 Naively, we might not expect a strong model to perform better than the weak supervisor that provides its training signal—it may simply learn to imitate all the errors the weak supervisor makes. On th

... (truncated, 7 KB total)

Resource ID: 3c2487da42fb53cb | Stable ID: sid_DJDXvzanpU