Back
Redwood Research, 2024
webredwoodresearch.org·redwoodresearch.org/research/ai-control
Redwood Research's AI Control page outlines their research agenda on maintaining safe AI behavior through oversight and adversarial testing, relevant to those studying practical safety mechanisms beyond alignment alone.
Metadata
Importance: 72/100organizational reporthomepage
Summary
Redwood Research's AI Control research program focuses on developing techniques to ensure AI systems behave safely even if they are misaligned or adversarially inclined, by building robust oversight and control mechanisms rather than relying solely on alignment. The approach emphasizes empirically evaluating whether safety measures hold up against a red-teamed 'untrusted' AI attempting to subvert them. This represents a complementary strategy to alignment research, treating safety as an engineering and evaluation problem.
Key Points
- •AI control asks: can we keep AI systems safe even if they are not fully aligned, by enforcing behavioral constraints and oversight?
- •The program red-teams AI safety measures by simulating adversarial AI behavior to test whether control protocols are robust.
- •Control is framed as complementary to alignment—providing safety guarantees without requiring full trust in the AI's values.
- •Research includes developing auditing methods, monitoring pipelines, and protocols for limiting AI autonomy in high-stakes settings.
- •Redwood's control work influenced broader discourse on 'trusted' vs 'untrusted' AI distinctions in deployment safety frameworks.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Easy | Argument | 53.0 |
| Technical AI Safety Research | Crux | 66.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20262 KB
- [Home](https://www.redwoodresearch.org/) - [Research](https://www.redwoodresearch.org/research) - [Team](https://www.redwoodresearch.org/team) - [Careers](https://www.redwoodresearch.org/careers) - [Blog](https://redwoodresearch.substack.com/) # AI Control Improving safety despite intentional subversion. We introduced and have continued to advance the research direction of “AI control.” An AI is controlled if it is unable to cause damage even if it is egregiously misaligned. Importantly, this property can be evaluated by measuring a model's capability to subvert control techniques instead of its propensity to do so. In our ICML oral paper [AI Control: Improving Safety Despite Intentional Subversion](https://arxiv.org/abs/2312.06942), we studied techniques that use weaker trusted models to oversee stronger untrusted models. As models continue to improve at eval awareness and strategic reasoning, it will become harder to verify that models are trustworthy and more important to bound risk in the event that they aren't. Since the seminal AI Control paper, Redwood Research has done a variety of follow-up work studying AI control in diverse settings such as [BashArena](https://www.bash-arena.com/), [BashBench](https://www.bashcontrol.com/), and [LinuxArena](https://linuxbench.ai/). Redwood Research also runs the yearly [ControlConf](https://controlconf.org/), a conference focused on AI Control. ### Learn More - — [Read the paper on arXiv](https://arxiv.org/abs/2312.06942) - — [Read the blog post on Alignment Forum](https://www.alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion) - — [AI Control explainer to a policy audience from Georgetown CSET](https://cset.georgetown.edu/article/ai-control-how-to-make-use-of-misbehaving-ai-agents/) - — [AI Control Reading List](https://blog.redwoodresearch.org/p/guide) YouTube Buck Shlegeris discussing AI Control on the 80,000 Hours Podcast
Resource ID:
eb2318c5e3fc0f88 | Stable ID: ZjA2MzVmMT