"The case for ensuring that powerful AIs are controlled" (May 2024)
webThis Redwood Research post by Shlegeris and Greenblatt is a foundational articulation of the 'AI control' research agenda, arguing labs should prioritize adversarially robust safety measures for near-term powerful models as a complement to long-term alignment work.
Metadata
Summary
Buck Shlegeris and Ryan Greenblatt argue that AI labs should implement 'AI control' measures ensuring powerful models cannot cause unacceptably bad outcomes even if they are misaligned and actively trying to subvert safety measures. They contend no fundamental research breakthroughs are needed to achieve control for early transformatively useful AIs, and that this approach substantially reduces risks from scheming/deceptive alignment. The post accompanies a technical paper demonstrating control evaluation methodology in a programming setting.
Key Points
- •AI control requires safety measures that remain robust even if models are misaligned and intentionally try to subvert them, not just well-behaved models.
- •The post targets 'scheming' behavior—where models choose actions to maximize their probability of eventually seizing power while appearing aligned.
- •Authors claim control is achievable for early transformative AIs without fundamental research breakthroughs, making it a near-term tractable safety approach.
- •Accompanied by a technical paper demonstrating methodology for evaluating whether safety measures actually control a model in a programming task setting.
- •Control is framed as complementary to alignment research: it reduces catastrophic risk even under uncertainty about whether models are truly aligned.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Easy | Argument | 53.0 |
| AI Control | Research Area | 75.0 |
| Technical AI Safety Research | Crux | 66.0 |
Cached Content Preview
[](https://blog.redwoodresearch.org/)
# [Redwood Research blog](https://blog.redwoodresearch.org/)
SubscribeSign in

Discover more from Redwood Research blog
We research catastrophic AI risks and techniques that could be used to mitigate them.
Over 3,000 subscribers
Subscribe
By subscribing, you agree Substack's [Terms of Use](https://substack.com/tos), and acknowledge its [Information Collection Notice](https://substack.com/ccpa#personal-data-collected) and [Privacy Policy](https://substack.com/privacy).
Already have an account? Sign in
# The case for ensuring that powerful AIs are controlled
### Labs should make sure that powerful models can't cause unacceptably bad outcomes even if the AIs try to.
[](https://substack.com/@buckshlegeris)[](https://substack.com/@ryangreenblatt)
[Buck Shlegeris](https://substack.com/@buckshlegeris) and [Ryan Greenblatt](https://substack.com/@ryangreenblatt)
May 07, 2024
18
2
2
Share
\[Originally posted to [LessWrong](https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled)\]
In this post, we argue that AI labs should ensure that powerful AIs are _controlled_. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
We recently released [a paper](https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion) in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We
... (truncated, 98 KB total)32c44bb7ba8a1bbe | Stable ID: NmE2MmIwMD