The case for ensuring that powerful AIs are controlled
webAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: LessWrong
Foundational post from Redwood Research introducing the AI control research agenda; companion to a formal paper on control evaluations in programming settings, and highly influential in shaping practical near-term AI safety strategies.
Forum Post Details
Metadata
Summary
Ryan Greenblatt and Buck Shlegeris argue that AI labs should invest in 'control' measures—safety mechanisms ensuring powerful AI systems cannot cause catastrophic harm even if misaligned and actively scheming—as a complement to alignment research. They present a methodology for evaluating safety measures against intentionally deceptive AI and demonstrate that meaningful control is achievable for early transformatively useful AIs without fundamental research breakthroughs.
Key Points
- •Distinguishes 'control' (preventing harm despite misalignment) from 'alignment' (preventing misalignment itself), arguing both deserve comparable investment.
- •Addresses 'scheming' AI—models that conceal misalignment, sabotage safety evaluations, and creatively subvert safety techniques to eventually seize power.
- •Presents a methodology for evaluating whether safety measures can withstand intentionally adversarial AI behavior in a programming task setting.
- •Claims no fundamental research breakthroughs are needed to achieve meaningful control for early transformatively useful AI systems.
- •Argues that requiring failure on both alignment AND control to cause catastrophe substantially reduces overall risk compared to relying on alignment alone.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Sharp Left Turn | Risk | 69.0 |
Cached Content Preview
x This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. The case for ensuring that powerful AIs are controlled — LessWrong Best of LessWrong 2024 AI Control Redwood Research Deceptive Alignment AI Curated 282
The case for ensuring that powerful AIs are controlled
by ryan_greenblatt , Buck 24th Jan 2024 AI Alignment Forum 34 min read 74 282
Ω 111
In this post, we argue that AI labs should ensure that powerful AIs are controlled . That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
We recently released a paper in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We also introduced several techniques that let us improve the level of safety substantially above baselines without substantially sacrificing performance. In this post, we'll describe AI control in a broader context and argue that achieving control is a promising approach to reducing risk from models that intentionally try to subvert safety mechanisms.
AI control aims to mitigate the risks that arise from models intentionally choosing actions that maximize their probability of eventually seizing power. This behavior has been called scheming ( Cotra 2021 , Carlsmith 2023 ) and is nearly the same as what Hubinger 2019 calls deceptive alignment; see those articles for arguments that scheming might arise naturally. Scheming might be a particularly difficult-to-handle source of AI takeover risk because schemers actively try to subvert your efforts to ensure safety: they conceal their misalignment, sabotage experiments you run to evaluate whether your deployment is safe, and creatively subvert your safety techniques. [1]
There are two main lines of defense you could employ to prevent schemers from causing catastrophes.
Alignment : Ensure that your models aren't scheming. [2]
Control : Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures. [3]
If you try to employ the control line-of-defense—that is, you try to make it so that you'll be safe even if your models are attempting to subvert your safety techniques and safety evaluations—then in order for scheming models to cause a catastrophe, you need to have failed at both alignment and control. We believe that the returns to investing in control are good enough that it's worth investing in
... (truncated, 98 KB total)2c41381bdf2fa681 | Stable ID: OTMwZjdiNm