Redwood Research's AI Control paper (December 2023)

paper

2023·arXiv·arxiv.org/pdf/2312.06942

Authors

Ryan Greenblatt·Buck Shlegeris·Kshitij Sachan·Fabien Roger

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational paper establishing the 'AI control' research agenda at Redwood Research; highly influential for thinking about scalable oversight and safety techniques robust to adversarial model behavior during deployment.

Paper Details

Citations

123

15 influential

Year

2023

Methodology

survey

arXiv:2312.06942 DOI:10.48550/arXiv.2312.06942 Semantic Scholar

Metadata

Importance: 88/100arxiv preprintprimary source

Abstract

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

Summary

This paper introduces 'AI control' as a framework for maintaining safety guarantees when deploying powerful but potentially misaligned models, developing protocols that combine weaker trusted models with stronger untrusted models. Two main protocols—trusted editing and untrusted monitoring—are evaluated against adversarial strategies where GPT-4 actively attempts to insert undetectable backdoors into code. Both protocols substantially outperform simple baselines, demonstrating that useful work can be extracted from untrusted models while preserving safety.

Key Points

•Introduces the 'AI control' problem: ensuring safety even when a deployed model is actively trying to subvert safety mechanisms, not just passively misaligned.
•Tests protocols using GPT-4 (powerful, untrusted) + GPT-3.5 (weaker, trusted) on programming tasks, measuring backdoor insertion rates vs. task performance.
•Trusted editing protocol: GPT-3.5 rates suspiciousness of GPT-4 code and edits flagged solutions, reducing backdoor risk without fully discarding GPT-4's capability.
•Untrusted monitoring protocol: multiple GPT-4 instances audit each other's outputs with collusion-prevention measures, leveraging the untrusted model's own capability.
•Demonstrates a safety-usefulness frontier, showing meaningful safety gains are achievable without sacrificing all task performance benefits from stronger models.

Cited by 3 pages

Page	Type	Quality
AI Safety Intervention Effectiveness Matrix	Analysis	73.0
Redwood Research	Organization	78.0
Technical AI Safety Research	Crux	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2312.06942] AI Control: Improving Safety Despite Intentional Subversion 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 AI Control: Improving Safety Despite Intentional Subversion

 
 
 Ryan Greenblatt ∗      Buck Shlegeris      Kshitij Sachan      Fabien Roger
 
 Redwood Research
 
 
 
 

 
 Abstract

 As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques (“protocols”) that are robust to intentional subversion.

 We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.

 We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing . This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring . This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

 
 * * footnotetext: Correspondence: ryan@rdwrs.com . Author contributions are listed in Appendix A . 
 
 
 1 Introduction

 
 As LLMs have advanced in capabilities, researchers have devised increasingly sophisticated methods to ensure that products powered by LLMs avoid undesired behaviors. For example, the core of Bing Chat is a fine-tuned GPT-4 that generates responses, but there are several other steps in its pipeline (Microsoft,, 2023 ) . For example:

 
 
 
 
 • 
 
 Before the pipeline shows images to GPT-4, it blurs all faces in the images.

 

 
 • 
 
 After GPT-4 has generated a response, the pipeline sends the full response to a moderation service that decides whether it was inappropriate; if it is, the response is deleted and replaced with a generic message saying that the model can’t help.

 

 
 
 
 In the future, when LLMs are more powerful and deployed mo

... (truncated, 98 KB total)

Resource ID: cc80ab28579c5794 | Stable ID: sid_XzxkL2xt2H