AI Control Framework
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Foundational paper by Redwood Research introducing 'AI control' as a complement to alignment research, focusing on maintaining safety guarantees against deceptive or adversarial AI in agentic settings. Highly influential in practical AI safety discourse.
Paper Details
Metadata
Abstract
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
Summary
This paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversight. It proposes evaluation protocols and mechanisms to maintain safety against intentional subversion by advanced AI models, treating safety as a red-team/blue-team problem between AI and human overseers.
Key Points
- •Introduces 'AI control' as a safety paradigm distinct from alignment: ensuring safe behavior even from models that are intentionally trying to cause harm or evade oversight.
- •Frames safety evaluation as an adversarial game where AI tries to subvert controls and humans/evaluators try to detect or prevent this.
- •Proposes concrete protocols including trusted monitoring, untrusted model pipelines, and auditing mechanisms for agentic LLM deployments.
- •Argues current alignment techniques may be insufficient if models are deceptive, making control mechanisms a necessary complementary layer.
- •Provides a formal framework for measuring the 'safety' and 'usefulness' tradeoffs of various control protocols under adversarial assumptions.
Cited by 7 pages
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| AI Safety Defense in Depth Model | Analysis | 69.0 |
| AI Safety Intervention Effectiveness Matrix | Analysis | 73.0 |
| AI Control | Research Area | 75.0 |
| AI Alignment | Approach | 91.0 |
| AI Alignment Research Agendas | Crux | 69.0 |
| Scheming & Deception Detection | Approach | 91.0 |
Cached Content Preview
# AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt∗ Buck Shlegeris Kshitij Sachan Fabien Roger
Redwood Research
###### Abstract
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques (“protocols”) that are robust to intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
\*\*footnotetext: Correspondence: [ryan@rdwrs.com](mailto:ryan@rdwrs.com ""). Author contributions are listed in Appendix [A](https://ar5iv.labs.arxiv.org/html/2312.06942#A1 "Appendix A Authors contributions ‣ AI Control: Improving Safety Despite Intentional Subversion").
## 1 Introduction
As LLMs have advanced in capabilities, researchers have devised increasingly sophisticated methods to ensure that products powered by LLMs avoid undesired behaviors. For example, the core of Bing Chat is a fine-tuned GPT-4 that generates responses, but there are several other steps in its pipeline (Microsoft,, [2023](https://ar5iv.labs.arxiv.org/html/2312.06942#bib.bib13 "")). For example:
- •
Before the pipeline shows images to GPT-4, it blurs all faces in the images.
- •
After GPT-4 has generated a response, the pipeline sends the full response to a moderation service that decides whether it was inappropriate; if it is, the response is deleted and replaced with a generic message saying that the model c
... (truncated, 98 KB total)187aaa26886ce183 | Stable ID: MGJmNGQyNW