Redwood Research's AI Control paper (December 2023)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Foundational paper establishing the 'AI control' research agenda at Redwood Research; highly influential for thinking about scalable oversight and safety techniques robust to adversarial model behavior during deployment.
Paper Details
Metadata
Abstract
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
Summary
This paper introduces 'AI control' as a framework for maintaining safety guarantees when deploying powerful but potentially misaligned models, developing protocols that combine weaker trusted models with stronger untrusted models. Two main protocols—trusted editing and untrusted monitoring—are evaluated against adversarial strategies where GPT-4 actively attempts to insert undetectable backdoors into code. Both protocols substantially outperform simple baselines, demonstrating that useful work can be extracted from untrusted models while preserving safety.
Key Points
- •Introduces the 'AI control' problem: ensuring safety even when a deployed model is actively trying to subvert safety mechanisms, not just passively misaligned.
- •Tests protocols using GPT-4 (powerful, untrusted) + GPT-3.5 (weaker, trusted) on programming tasks, measuring backdoor insertion rates vs. task performance.
- •Trusted editing protocol: GPT-3.5 rates suspiciousness of GPT-4 code and edits flagged solutions, reducing backdoor risk without fully discarding GPT-4's capability.
- •Untrusted monitoring protocol: multiple GPT-4 instances audit each other's outputs with collusion-prevention measures, leveraging the untrusted model's own capability.
- •Demonstrates a safety-usefulness frontier, showing meaningful safety gains are achievable without sacrificing all task performance benefits from stronger models.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Safety Intervention Effectiveness Matrix | Analysis | 73.0 |
| Redwood Research | Organization | 78.0 |
| Technical AI Safety Research | Crux | 66.0 |
Cached Content Preview
# AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt∗ Buck Shlegeris Kshitij Sachan Fabien Roger
Redwood Research
###### Abstract
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques (“protocols”) that are robust to intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
\*\*footnotetext: Correspondence: [ryan@rdwrs.com](mailto:ryan@rdwrs.com ""). Author contributions are listed in Appendix [A](https://ar5iv.labs.arxiv.org/html/2312.06942#A1 "Appendix A Authors contributions ‣ AI Control: Improving Safety Despite Intentional Subversion").
## 1 Introduction
As LLMs have advanced in capabilities, researchers have devised increasingly sophisticated methods to ensure that products powered by LLMs avoid undesired behaviors. For example, the core of Bing Chat is a fine-tuned GPT-4 that generates responses, but there are several other steps in its pipeline (Microsoft,, [2023](https://ar5iv.labs.arxiv.org/html/2312.06942#bib.bib13 "")). For example:
- •
Before the pipeline shows images to GPT-4, it blurs all faces in the images.
- •
After GPT-4 has generated a response, the pipeline sends the full response to a moderation service that decides whether it was inappropriate; if it is, the response is deleted and replaced with a generic message saying that the model c
... (truncated, 98 KB total)cc80ab28579c5794 | Stable ID: MTZkN2QwZW