Introducing Superalignment

web

OpenAI·openai.com/index/introducing-superalignment/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

This is OpenAI's official announcement of its Superalignment initiative; notable because the team was later effectively dissolved in mid-2024 following the departures of Leike and Sutskever, raising questions about OpenAI's long-term alignment commitments.

Metadata

Importance: 72/100blog postprimary source

Summary

OpenAI announced the formation of its Superalignment team in July 2023, co-led by Ilya Sutskever and Jan Leike, dedicated to solving the problem of aligning superintelligent AI systems within four years. The team aims to build a roughly human-level automated alignment researcher using scalable oversight, automated interpretability, and adversarial testing, backed by 20% of OpenAI's secured compute.

Key Points

•OpenAI committed 20% of its secured compute over four years to the Superalignment team, co-led by Ilya Sutskever and Jan Leike.
•Core technical approach: scalable oversight, generalization research, automated interpretability, robustness testing, and adversarial misalignment detection.
•Goal is to build a human-level automated alignment researcher that can then help iteratively align superintelligent systems.
•Current alignment methods like RLHF are insufficient for superintelligence since humans cannot reliably supervise systems far smarter than themselves.
•OpenAI acknowledged superintelligence could arrive this decade and poses risks including human disempowerment or extinction.

Cited by 4 pages

Page	Type	Quality
Sam Altman	Person	40.0
AI-Assisted Alignment	Approach	63.0
AI Alignment Research Agendas	Crux	69.0
Technical AI Safety Research	Crux	66.0

Cached Content Preview

HTTP 200Fetched Feb 23, 20269 KB

Switch to

- [ChatGPT(opens in a new window)](https://chatgpt.com/?openaicom-did=c0f924bf-86af-4218-afdf-03fd0899e01f&openaicom_referred=true)
- [Sora(opens in a new window)](https://sora.com/)
- [API Platform(opens in a new window)](https://platform.openai.com/)

Introducing Superalignment \| OpenAI

July 5, 2023

# Introducing Superalignment

###### We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort. We’re looking for excellent ML researchers and engineers to join us.

* * *

Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.

While superintelligence[A](https://openai.com/index/introducing-superalignment/#citation-bottom-A) seems far off now, we believe it could arrive this decade.

Managing these risks will require, among [other things⁠](https://openai.com/index/how-should-ai-systems-behave/), [new institutions for governance⁠](https://openai.com/index/governance-of-superintelligence/) and solving the problem of superintelligence alignment:

_How do we ensure AI systems much smarter than humans follow human intent?_

Currently, we don't have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue. Our current techniques for aligning AI, such as [reinforcement learning from human feedback⁠](https://openai.com/index/instruction-following/), rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us,[B](https://openai.com/index/introducing-superalignment/#citation-bottom-B) and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.

#### Our approach

Our goal is to build a roughly human-level [automated alignment researcher⁠](https://openai.com/blog/our-approach-to-alignment-research/). We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:

1. To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to [assist evaluation of other AI systems⁠](https://openai.com/research/critiques/) ( _scalable oversight)._ In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise ( _generalization)_.
2. To validate the alignment of our systems, we [automate search for problema

... (truncated, 9 KB total)

Resource ID: 704f57dfad89c1b3 | Stable ID: sid_wBaV3nPKAJ