The case for ensuring that powerful AIs are controlled

web

2024·LessWrong·lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuri...

Authors

ryan_greenblatt·Buck

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

Foundational post from Redwood Research introducing the AI control research agenda; companion to a formal paper on control evaluations in programming settings, and highly influential in shaping practical near-term AI safety strategies.

Forum Post Details

Karma

282

Comments

Forum

lesswrong

Status

Curated

Forum Tags

AI ControlRedwood ResearchDeceptive AlignmentAI

Metadata

Importance: 85/100blog postprimary source

Summary

Ryan Greenblatt and Buck Shlegeris argue that AI labs should invest in 'control' measures—safety mechanisms ensuring powerful AI systems cannot cause catastrophic harm even if misaligned and actively scheming—as a complement to alignment research. They present a methodology for evaluating safety measures against intentionally deceptive AI and demonstrate that meaningful control is achievable for early transformatively useful AIs without fundamental research breakthroughs.

Key Points

•Distinguishes 'control' (preventing harm despite misalignment) from 'alignment' (preventing misalignment itself), arguing both deserve comparable investment.
•Addresses 'scheming' AI—models that conceal misalignment, sabotage safety evaluations, and creatively subvert safety techniques to eventually seize power.
•Presents a methodology for evaluating whether safety measures can withstand intentionally adversarial AI behavior in a programming task setting.
•Claims no fundamental research breakthroughs are needed to achieve meaningful control for early transformatively useful AI systems.
•Argues that requiring failure on both alignment AND control to cause catastrophe substantially reduces overall risk compared to relying on alignment alone.

Cited by 1 page

Page	Type	Quality
Sharp Left Turn	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

# The case for ensuring that powerful AIs are controlled
By ryan_greenblatt, Buck
Published: 2024-01-24
In this post, we argue that AI labs should ensure that powerful AIs are *controlled*. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

We recently released [a paper](https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion) in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We also introduced several techniques that let us improve the level of safety substantially above baselines without substantially sacrificing performance. In this post, we'll describe AI control in a broader context and argue that achieving control is a promising approach to reducing risk from models that intentionally try to subvert safety mechanisms.

AI control aims to mitigate the risks that arise from models intentionally choosing actions that maximize their probability of eventually seizing power. This behavior has been called *scheming* ([Cotra 2021](https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/), [Carlsmith 2023](https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power)) and is nearly the same as what [Hubinger 2019](https://www.lesswrong.com/posts/FkgsxrGf3QxhfLWHG/risks-from-learned-optimization-introduction) calls deceptive alignment; see those articles for arguments that scheming might arise naturally. Scheming might be a particularly difficult-to-handle source of AI takeover risk because schemers actively try to subvert your efforts to ensure safety: they conceal their misalignment, sabotage experiments you run to evaluate whether your deployment is safe, and creatively subvert your safety techniques.[^footnote2]

[^footnote2]: AI control could also be applicable to handling advanced data poisoning attacks, non-scheming misalignment concerns, and API-enabled catastrophic misuse, but we expect there are easier ways to sufficiently handle these concerns.

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

- **Alignment**: Ensure that your models aren't scheming.[^footnote3]
- **Control**: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.[^footnote4]

[^footnote3]: In this post, we'll mostly use "alignment" to refer 

... (truncated, 98 KB total)

Resource ID: 2c41381bdf2fa681 | Stable ID: sid_wuPHk5oCes