A Sketch of an AI Control Safety Case - arXiv
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Co-authored by UK AI Safety Institute and Redwood Research, this paper is a key practical contribution to the emerging 'safety cases' literature, offering a concrete template for how AI developers could argue safe deployment of capable agents under the AI control paradigm.
Paper Details
Metadata
Abstract
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
Summary
This paper presents a framework for constructing 'control safety cases'—structured, evidence-based arguments that AI systems cannot subvert safety measures to cause harm. Using data exfiltration as a case study, the authors demonstrate how red-team evaluations, control measure robustness, and conservative extrapolation can together justify safe deployment of capable LLM agents. The work bridges dangerous capability evaluations with AI control methodology to produce concrete, auditable safety arguments.
Key Points
- •Introduces 'control safety cases' as a structured framework for arguing that LLM agents cannot subvert monitoring and control measures to cause unacceptable outcomes.
- •Uses a concrete case study: arguing that an LLM agent deployed at an AI company won't exfiltrate sensitive intellectual property to the public internet.
- •Safety case relies on three claims: red teams adequately elicit model capabilities, control measures hold in deployment, and developers conservatively extrapolate performance.
- •Builds on Greenblatt et al.'s AI control methodology, assuming models pursue their most effective strategies to cause harm—a conservative and robust framing.
- •Represents a step toward scalable, evidence-based risk management for frontier AI, complementing 'inability' safety cases grounded in capability evaluations.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Redwood Research | Organization | 78.0 |
Cached Content Preview
[2501.17315] A sketch of an AI control safety case
A sketch of an AI control safety case
Tomek Korbak tomek.korbak@dsit.gov.uk
UK AI Safety Institute
Joshua Clymer joshua@rdwrs.com
Redwood Research
Benjamin Hilton benjamin.hilton@dsit.gov.uk
UK AI Safety Institute
Buck Shlegeris buck@rdwrs.com
Redwood Research
Geoffrey Irving geoffrey.irving@dsit.gov.uk
UK AI Safety Institute
Abstract
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a “control safety case”, which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won’t exfiltrate sensitive information. The sketch relies on evidence from a “control evaluation,” where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
Figure 1 : An illustration of the control evaluation that provides the core evidence used in our control safety case sketch. During the control evaluation, a red team constructs adversarial substitute models to cause unacceptable outcomes in a deployment proxy consisting of 2,000 AI research tasks, followed by extrapolation to a larger deployment setting.
Executive Summary
Safety cases . Safety cases — structured, evidence-based arguments that a system is safe — are emerging as a scalable risk management tool for frontier AI (Buhl et al., 2024 ; Clymer et al., 2024 ; Irving, 2024 ) . Currently, many frontier AI developers assess risk with dangerous capability evaluations (Anthropic, 2024a ; OpenAI, 2024b ; Phuong et al., 2024 ) . Evidence from these evaluations can substantiate ‘inability’ safety cases, which assert that models cannot, for example, aid biological weapons development (Mouton et al., 2024 ) , sabotage internal safety procedures (Benton et al., 2024 ) , or execute cyber attacks (Goemans et al., 2024 ) ; however, LLMs might eventually gain dangerously capabilities that developers need to mitigate with safety measures (OpenAI, 2024a ; Anthropic, 2024b ; Google DeepMind, 2024 ) .
AI control . This paper presents a safety case sketch that argues models are safe because of measures such as monitoring and human auditing. The arguments rest on the “AI control”
... (truncated, 98 KB total)56bae0415b228464 | Stable ID: ZTM0ODUxZj