AI Control Framework

paper

2023·arXiv·arxiv.org/abs/2312.06942

Authors

Ryan Greenblatt·Buck Shlegeris·Kshitij Sachan·Fabien Roger

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational paper by Redwood Research introducing 'AI control' as a complement to alignment research, focusing on maintaining safety guarantees against deceptive or adversarial AI in agentic settings. Highly influential in practical AI safety discourse.

Paper Details

Citations

123

15 influential

Year

2023

Methodology

survey

arXiv:2312.06942 DOI:10.48550/arXiv.2312.06942 Semantic Scholar

Metadata

Importance: 85/100arxiv preprintprimary source

Abstract

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

Summary

This paper introduces the 'AI control' framework, which focuses on ensuring AI systems behave safely even if they are deceptively misaligned or actively trying to subvert oversight. It proposes evaluation protocols and mechanisms to maintain safety against intentional subversion by advanced AI models, treating safety as a red-team/blue-team problem between AI and human overseers.

Key Points

•Introduces 'AI control' as a safety paradigm distinct from alignment: ensuring safe behavior even from models that are intentionally trying to cause harm or evade oversight.
•Frames safety evaluation as an adversarial game where AI tries to subvert controls and humans/evaluators try to detect or prevent this.
•Proposes concrete protocols including trusted monitoring, untrusted model pipelines, and auditing mechanisms for agentic LLM deployments.
•Argues current alignment techniques may be insufficient if models are deceptive, making control mechanisms a necessary complementary layer.
•Provides a formal framework for measuring the 'safety' and 'usefulness' tradeoffs of various control protocols under adversarial assumptions.

Cited by 7 pages

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0
AI Safety Defense in Depth Model	Analysis	69.0
AI Safety Intervention Effectiveness Matrix	Analysis	73.0
AI Control	Research Area	75.0
AI Alignment	Approach	91.0
AI Alignment Research Agendas	Crux	69.0
Scheming & Deception Detection	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 27, 202698 KB

AI Control: Improving Safety Despite Intentional Subversion

 
 
 Ryan Greenblatt ∗      Buck Shlegeris      Kshitij Sachan      Fabien Roger
 
 Redwood Research
 
 
 
 

 
 Abstract

 As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques (“protocols”) that are robust to intentional subversion.

 We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.

 We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing . This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring . This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

 
 * * footnotetext: Correspondence: ryan@rdwrs.com . Author contributions are listed in Appendix A . 
 
 
 1 Introduction

 
 As LLMs have advanced in capabilities, researchers have devised increasingly sophisticated methods to ensure that products powered by LLMs avoid undesired behaviors. For example, the core of Bing Chat is a fine-tuned GPT-4 that generates responses, but there are several other steps in its pipeline (Microsoft,, 2023 ) . For example:

 
 
 
 
 • 
 
 Before the pipeline shows images to GPT-4, it blurs all faces in the images.

 

 
 • 
 
 After GPT-4 has generated a response, the pipeline sends the full response to a moderation service that decides whether it was inappropriate; if it is, the response is deleted and replaced with a generic message saying that the model can’t help.

 

 
 
 
 In the future, when LLMs are more powerful and deployed more autonomously, they will increasingly have the opportunity and ability to take actions that might have significan

... (truncated, 98 KB total)

Resource ID: 187aaa26886ce183 | Stable ID: sid_86tdvUap31