Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Foundational OpenAI Superalignment team paper (Dec 2023) introducing weak-to-strong generalization as a key research paradigm; highly relevant to scalable oversight and the broader challenge of aligning superintelligent systems.

Metadata

Importance: 88/100blog postprimary source

Summary

OpenAI's Superalignment team introduces a research paradigm for tackling superintelligence alignment by studying whether weak models can supervise stronger ones. They demonstrate that a GPT-2-level supervisor can elicit near GPT-3.5-level performance from GPT-4, showing that strong pretrained models can generalize beyond their weak supervisor's limitations. This provides an empirically tractable analogy for the core challenge of humans supervising superhuman AI.

Key Points

  • Proposes weak-to-strong generalization as a tractable empirical analogy for the superalignment problem, where humans must supervise AI smarter than themselves.
  • Demonstrates GPT-2-level models can supervise GPT-4 to achieve close to GPT-3.5-level performance, even on hard problems the weak model fails at.
  • Strong pretrained models leverage latent knowledge to generalize beyond flawed/incomplete weak supervisor labels, rather than simply imitating errors.
  • Introduces a scalable research direction allowing iterative empirical progress on superalignment challenges before superintelligence is developed.
  • Current alignment methods like RLHF rely on human supervision that will break down when AI can produce outputs too complex for humans to evaluate.

Cited by 3 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 20268 KB
OpenAI

December 14, 2023

[Safety](https://openai.com/news/safety-alignment/)

# Weak-to-strong generalization

[Read paper(opens in a new window)](https://cdn.openai.com/papers/weak-to-strong-generalization.pdf)

![Weak To Strong Generalization](https://images.ctfassets.net/kftzwdyauwt9/1tCf4AONiCc3OkX47FmFy0/f95e25993d309257c631c4e64b699685/weak-to-strong-generalization.jpg?w=3840&q=90&fm=webp)

Justin Jay Wang × DALL·E

Loading…

Share

We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak supervisors?

A core challenge for aligning future superhuman AI systems (superalignment) is that humans will need to supervise AI systems much smarter than them. We study a simple analogy: can small models supervise large models? We show that we can use a GPT‑2‑level model to elicit most of GPT‑4’s capabilities—close to GPT‑3.5‑level performance—generalizing correctly even to hard problems where the small model failed. This opens up a new research direction that allows us to directly tackle a central challenge of aligning future superhuman models while making iterative empirical progress today.

## The superalignment problem

We believe superintelligence—AI vastly smarter than humans—could be developed within the next ten years. However, we still do not know how to reliably steer and control superhuman AI systems. Solving this problem is essential for ensuring that even the most advanced AI systems in the future remain safe and beneficial to humanity.

We formed the [Superalignment team⁠](https://openai.com/superalignment/) earlier this year to solve this problem of superintelligence alignment. Today, we are releasing the team’s first paper, which introduces a new research direction for empirically aligning superhuman models.

Current alignment methods, such as reinforcement learning from human feedback (RLHF), rely on human supervision. However, future AI systems will be capable of extremely complex and creative behaviors that will make it hard for humans to reliably supervise them. For example, superhuman models may be able to write millions of lines of novel—and potentially dangerous—computer code that would be very hard even for expert humans to understand.

Relative to superhuman AI models, humans will be “weak supervisors.” This is a core challenge for AGI alignment: how can weak supervisors trust and control substantially stronger models?

## Our setup

To make progress on this core challenge, we propose an analogy we can empirically study today: **can we use a smaller (less capable) model to supervise a larger (more capable) model?**

![Superalignmentblog Artwork Transparent](https://images.ctfassets.net/kftzwdyauwt9/7f37b3a2-c7fa-4e5d-85517bd29a33/cc8ca2aa652e9edc6e724977f64b2c64/SuperAlignmentBlog_Artwork_Transparent.png?w=3840&q=90&fm=webp)

**A simple analogy for superalignment:** In traditional machine le

... (truncated, 8 KB total)
Resource ID: 3c2487da42fb53cb | Stable ID: ZGZiMDVjMD