A Sketch of an AI Control Safety Case - arXiv

paper

2025·arXiv·arxiv.org/abs/2501.17315

Authors

Tomek Korbak·Joshua Clymer·Benjamin Hilton·Buck Shlegeris·Geoffrey Irving

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Co-authored by UK AI Safety Institute and Redwood Research, this paper is a key practical contribution to the emerging 'safety cases' literature, offering a concrete template for how AI developers could argue safe deployment of capable agents under the AI control paradigm.

Paper Details

Citations

3 influential

Year

2025

arXiv:2501.17315 DOI:10.48550/arXiv.2501.17315 Semantic Scholar

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

Summary

This paper presents a framework for constructing 'control safety cases'—structured, evidence-based arguments that AI systems cannot subvert safety measures to cause harm. Using data exfiltration as a case study, the authors demonstrate how red-team evaluations, control measure robustness, and conservative extrapolation can together justify safe deployment of capable LLM agents. The work bridges dangerous capability evaluations with AI control methodology to produce concrete, auditable safety arguments.

Key Points

•Introduces 'control safety cases' as a structured framework for arguing that LLM agents cannot subvert monitoring and control measures to cause unacceptable outcomes.
•Uses a concrete case study: arguing that an LLM agent deployed at an AI company won't exfiltrate sensitive intellectual property to the public internet.
•Safety case relies on three claims: red teams adequately elicit model capabilities, control measures hold in deployment, and developers conservatively extrapolate performance.
•Builds on Greenblatt et al.'s AI control methodology, assuming models pursue their most effective strategies to cause harm—a conservative and robust framing.
•Represents a step toward scalable, evidence-based risk management for frontier AI, complementing 'inability' safety cases grounded in capability evaluations.

Cited by 1 page

Page	Type	Quality
Redwood Research	Organization	78.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2501.17315] A sketch of an AI control safety case 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 A sketch of an AI control safety case

 
 
 Tomek Korbak tomek.korbak@dsit.gov.uk 
 UK AI Safety Institute
 Joshua Clymer joshua@rdwrs.com 
 Redwood Research
 Benjamin Hilton benjamin.hilton@dsit.gov.uk 
 UK AI Safety Institute
 Buck Shlegeris buck@rdwrs.com 
 Redwood Research
 Geoffrey Irving geoffrey.irving@dsit.gov.uk 
 UK AI Safety Institute 
 
 

 
 Abstract

 As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a “control safety case”, which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won’t exfiltrate sensitive information. The sketch relies on evidence from a “control evaluation,” where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

 
 
 Figure 1 : An illustration of the control evaluation that provides the core evidence used in our control safety case sketch. During the control evaluation, a red team constructs adversarial substitute models to cause unacceptable outcomes in a deployment proxy consisting of 2,000 AI research tasks, followed by extrapolation to a larger deployment setting. 
 
 
 Executive Summary

 
 Safety cases . Safety cases — structured, evidence-based arguments that a system is safe — are emerging as a scalable risk management tool for frontier AI (Buhl et al., 2024 ; Clymer et al., 2024 ; Irving, 2024 ) . Currently, many frontier AI developers assess risk with dangerous capability evaluations (Anthropic, 2024a ; OpenAI, 2024b ; Phuong et al., 2024 ) . Evidence from these evaluations can substantiate ‘inability’ safety cases, which assert that models cannot, for example, aid biological weapons development (Mouton et al., 2024 ) , sabotage internal safety procedures (Benton et al., 2024 ) , or execute cyber attacks (Goemans et al., 2024 ) ; however, LLMs might eventually gain dangerously capabilities that developers need to mitigate with safety measures (OpenAI, 2024a ; Anthropic, 2024b ; Google DeepMind, 2024 ) .

 
 
 AI control . This paper presents a safety case sketch that argues models are safe because of measures such as monitoring and human auditing. The arguments rest on the “AI control”

... (truncated, 98 KB total)

Resource ID: 56bae0415b228464 | Stable ID: sid_Ia819yvQoZ