Towards Safety Cases For AI Scheming
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Apollo Research
Collaborative report from Apollo Research and major safety organizations providing one of the first structured frameworks for safety cases specifically targeting deceptive alignment/scheming; useful reference for anyone working on AI evaluation methodology or safety cases.
Metadata
Summary
Apollo Research, in collaboration with UK AISI, METR, Redwood Research, and UC Berkeley, proposes a framework for constructing structured safety cases specifically addressing AI scheming—where systems covertly pursue misaligned goals. The report outlines three core arguments (Scheming Inability, Harm Inability, Harm Control) and acknowledges significant open challenges in making such cases robust.
Key Points
- •Scheming is defined as AI systems hiding true capabilities/objectives while covertly pursuing misaligned goals, e.g., appearing helpful during testing but acting harmfully once deployed.
- •Three core safety case arguments proposed: Scheming Inability, Harm Inability, and Harm Control, each requiring distinct evaluation approaches.
- •Safety cases for AI draw analogy from high-stakes engineering domains (nuclear, aviation) and provide structured arguments that deployment won't cause catastrophic harm.
- •Authors acknowledge the work is an early step and currently lacks crucial details needed for a strong safety case—significant open research problems remain.
- •Alignment evidence can supplement safety cases, but scheming makes traditional alignment verification significantly harder.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Capability Sandbagging | Risk | 67.0 |
Cached Content Preview
31/10/2024
# Towards Safety Cases For AI Scheming
Listen to articlePausePlayForward 15sRewind 15s
00:00
Settings
SpeedNormal
SpeedGo back to previous menu
0.5×Normal1.5×2×
_Read the full paper_ [_here_](https://arxiv.org/abs/2411.03336) _._
Developers of frontier AI systems will face increasingly challenging decisions about whether their AI systems are safe enough to develop and deploy. One reason why systems may not be safe is if they engage in _scheming_, where AI systems hide their true capabilities and objectives and covertly pursue misaligned goals.
In our new report, [“Towards evaluations-based safety cases for AI scheming”](https://arxiv.org/abs/2411.03336), written in collaboration with researchers from the UK AI Safety Institute, METR, Redwood Research, and UC Berkeley, we sketch how developers of AI systems could make a structured rationale – ‘a safety case’ – that an AI system is unlikely to cause catastrophic outcomes through scheming.
This work should be seen as an early step in advancing the state of the field. We think it currently lacks crucial details that would be required to make a strong safety case.

```
Figure 1.
```
A condensed version of an example safety case sketch, included in the report. Provided for illustration.
## Scheming and safety cases
We use _scheming_ as an umbrella term for cases when AI systems pursue misaligned goals covertly, hiding their true capabilities and objectives. For example, an AI system might act helpful during testing, while secretly engaging in harmful actions such as sabotage once deployed. Scheming has not been observed “in the wild” yet, however publicly available AI systems have behaved egregiously misaligned ( [Mowshowitz, 2023](https://www.lesswrong.com/posts/WkchhorbLsSMbLacZ/ai-1-sydney-and-bing)) and research has shown examples of AI systems engaging in deception, including pretending to be less capable than they are ( [Sheurer et al, 2023](https://arxiv.org/abs/2311.07590), [Järviniemi & Hubinger, 2024](https://arxiv.org/abs/2405.01576)). We think that as AI systems become more capable, scheming will be a particularly risky behavior that could lead to loss of control over AI systems.
The present work is part of an emerging approach to justifying safety of general-purpose AI systems: _safety cases_, which provide structured arguments that deploying an AI system is unlikely to cause significant harm ( [Clymer et al, 2024](https://arxiv.org/abs/2403.10462), [Irving, 2024](https://www.aisi.gov.uk/work/safety-cases-at-aisi), [Buhl et al, 2024](https://arxiv.org/abs/2410.21572)). Similar to how nuclear power plant and aircraft developers make safety cases before deploying their systems, developers of highly-capable AI systems could use safety cases to demonstrate that their systems won’t cause catastrophic harm — whether through misuse, misalignment or scheming spe
... (truncated, 6 KB total)49dc5db3dc90b264 | Stable ID: NjAxM2QwOG