Redwood Research, 2024

web

redwoodresearch.org·redwoodresearch.org/research/ai-control

Redwood Research's AI Control page outlines their research agenda on maintaining safe AI behavior through oversight and adversarial testing, relevant to those studying practical safety mechanisms beyond alignment alone.

Metadata

Importance: 72/100organizational reporthomepage

Summary

Redwood Research's AI Control research program focuses on developing techniques to ensure AI systems behave safely even if they are misaligned or adversarially inclined, by building robust oversight and control mechanisms rather than relying solely on alignment. The approach emphasizes empirically evaluating whether safety measures hold up against a red-teamed 'untrusted' AI attempting to subvert them. This represents a complementary strategy to alignment research, treating safety as an engineering and evaluation problem.

Key Points

•AI control asks: can we keep AI systems safe even if they are not fully aligned, by enforcing behavioral constraints and oversight?
•The program red-teams AI safety measures by simulating adversarial AI behavior to test whether control protocols are robust.
•Control is framed as complementary to alignment—providing safety guarantees without requiring full trust in the AI's values.
•Research includes developing auditing methods, monitoring pipelines, and protocols for limiting AI autonomy in high-stakes settings.
•Redwood's control work influenced broader discourse on 'trusted' vs 'untrusted' AI distinctions in deployment safety frameworks.

Cited by 2 pages

Page	Type	Quality
Why Alignment Might Be Easy	Argument	53.0
Technical AI Safety Research	Crux	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20261 KB

Redwood Research Home 
 Research 
 Team 
 Careers 
 Blog 
 AI Control

 Improving safety despite intentional subversion.

 We introduced and have continued to advance the research direction of “AI control.” An AI is controlled if it is unable to cause damage even if it is egregiously misaligned. Importantly, this property can be evaluated by measuring a model&#x27;s capability to subvert control techniques instead of its propensity to do so.

 In our ICML oral paper AI Control: Improving Safety Despite Intentional Subversion , we studied techniques that use weaker trusted models to oversee stronger untrusted models. As models continue to improve at eval awareness and strategic reasoning, it will become harder to verify that models are trustworthy and more important to bound risk in the event that they aren&#x27;t.

 Since the seminal AI Control paper, Redwood Research has done a variety of follow-up work studying AI control in diverse settings such as BashArena , BashBench , and LinuxArena . Redwood Research also runs the yearly ControlConf , a conference focused on AI Control.

 Learn More

 — Read the paper on arXiv 
 — Read the blog post on Alignment Forum 
 — AI Control explainer to a policy audience from Georgetown CSET 
 — AI Control Reading List 
 Buck Shlegeris discussing AI Control on the 80,000 Hours Podcast

Resource ID: eb2318c5e3fc0f88 | Stable ID: sid_2a4MZrSjXR