Skip to content
Longterm Wiki
Back

"The case for ensuring that powerful AIs are controlled" (May 2024)

web

This Redwood Research post by Shlegeris and Greenblatt is a foundational articulation of the 'AI control' research agenda, arguing labs should prioritize adversarially robust safety measures for near-term powerful models as a complement to long-term alignment work.

Metadata

Importance: 78/100blog postprimary source

Summary

Buck Shlegeris and Ryan Greenblatt argue that AI labs should implement 'AI control' measures ensuring powerful models cannot cause unacceptably bad outcomes even if they are misaligned and actively trying to subvert safety measures. They contend no fundamental research breakthroughs are needed to achieve control for early transformatively useful AIs, and that this approach substantially reduces risks from scheming/deceptive alignment. The post accompanies a technical paper demonstrating control evaluation methodology in a programming setting.

Key Points

  • AI control requires safety measures that remain robust even if models are misaligned and intentionally try to subvert them, not just well-behaved models.
  • The post targets 'scheming' behavior—where models choose actions to maximize their probability of eventually seizing power while appearing aligned.
  • Authors claim control is achievable for early transformative AIs without fundamental research breakthroughs, making it a near-term tractable safety approach.
  • Accompanied by a technical paper demonstrating methodology for evaluating whether safety measures actually control a model in a programming task setting.
  • Control is framed as complementary to alignment research: it reduces catastrophic risk even under uncertainty about whether models are truly aligned.

Cited by 3 pages

PageTypeQuality
Why Alignment Might Be EasyArgument53.0
AI ControlResearch Area75.0
Technical AI Safety ResearchCrux66.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[![Redwood Research blog](https://substackcdn.com/image/fetch/$s_!iGUa!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F995f564f-4b5e-4d78-b08a-8a8aa138d35f_550x550.png)](https://blog.redwoodresearch.org/)

# [Redwood Research blog](https://blog.redwoodresearch.org/)

SubscribeSign in

![User's avatar](https://substackcdn.com/image/fetch/$s_!a1J8!,w_64,h_64,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F156c87bd-db07-4458-b332-a8d39f1b56a8_1689x1743.jpeg)

Discover more from Redwood Research blog

We research catastrophic AI risks and techniques that could be used to mitigate them.

Over 3,000 subscribers

Subscribe

By subscribing, you agree Substack's [Terms of Use](https://substack.com/tos), and acknowledge its [Information Collection Notice](https://substack.com/ccpa#personal-data-collected) and [Privacy Policy](https://substack.com/privacy).

Already have an account? Sign in

# The case for ensuring that powerful AIs are controlled

### Labs should make sure that powerful models can't cause unacceptably bad outcomes even if the AIs try to.

[![Buck Shlegeris's avatar](https://substackcdn.com/image/fetch/$s_!a1J8!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F156c87bd-db07-4458-b332-a8d39f1b56a8_1689x1743.jpeg)](https://substack.com/@buckshlegeris)[![Ryan Greenblatt's avatar](https://substackcdn.com/image/fetch/$s_!zfEA!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdacc9747-8651-4cb6-b57e-235ec716e7b3_885x885.jpeg)](https://substack.com/@ryangreenblatt)

[Buck Shlegeris](https://substack.com/@buckshlegeris) and [Ryan Greenblatt](https://substack.com/@ryangreenblatt)

May 07, 2024

18

2

2

Share

\[Originally posted to [LessWrong](https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled)\]

In this post, we argue that AI labs should ensure that powerful AIs are _controlled_. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

We recently released [a paper](https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion) in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We

... (truncated, 98 KB total)
Resource ID: 32c44bb7ba8a1bbe | Stable ID: NmE2MmIwMD