Introducing Superalignment

web

OpenAI·openai.com/blog/introducing-superalignment

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

This July 2023 announcement marked a major institutional commitment by OpenAI to long-term alignment research; the Superalignment team was later restructured following the departures of Sutskever and Leike in 2024.

Metadata

Importance: 78/100blog postprimary source

Summary

OpenAI announces the formation of its Superalignment team, co-led by Ilya Sutskever and Jan Leike, with a goal of solving the core technical challenges of superintelligence alignment within four years. The team will dedicate 20% of OpenAI's secured compute to developing scalable oversight, automated interpretability, robustness testing, and adversarial testing methods to align AI systems smarter than humans.

Key Points

•OpenAI commits 20% of its secured compute over four years to the Superalignment team focused on superintelligence alignment.
•Current alignment techniques like RLHF rely on human supervision and will not scale to superintelligent systems.
•Core research pillars: scalable oversight, generalization, automated interpretability, robustness, and adversarial testing.
•The team's primary goal is to build a roughly human-level automated alignment researcher that can iteratively help align superintelligence.
•OpenAI believes superintelligence could arrive this decade, making solving alignment an urgent priority.

Cited by 1 page

Page	Type	Quality
Jan Leike	Person	27.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20267 KB

Introducing Superalignment | OpenAI OpenAI July 5, 2023

 Introducing Superalignment

 We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort. We’re looking for excellent ML researchers and engineers to join us. 

 Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction. 

 While superintelligence A seems far off now, we believe it could arrive this decade. 

 Managing these risks will require, among other things ⁠ , new institutions for governance ⁠ and solving the problem of superintelligence alignment: 

 How do we ensure AI systems much smarter than humans follow human intent? 

 Currently, we don&#x27;t have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue. Our current techniques for aligning AI, such as reinforcement learning from human feedback ⁠ , rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us, B and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs. 

 Our approach 

 Our goal is to build a roughly human-level automated alignment researcher ⁠ . We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline: 

 To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems ⁠ ( scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise ( generalization) . 
 To validate the alignment of our systems, we automate search for problematic behavior ⁠ (opens in a new window) ( robustness) and problematic internals ( automated interpretability ⁠ ). 
 Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments ( adversarial testing ). 
 We expect our research priorities will evolve substantially as we learn more about the problem and we’ll likely add entirely new research areas. We are planning to share more on our roadmap in the future. 

 The new team 

 We are assembling a team of top machine learning researchers and engineers to work on this problem.  

 We are dedicating 20% of the compute we’ve secured 

... (truncated, 7 KB total)

Resource ID: 41000216ddbfc99d | Stable ID: sid_fkT47kAi3Q