Accident Risks (Overview)
Overview
Accident risks arise when AI systems behave in unintended or harmful ways despite good-faith efforts by developers to make them safe. These risks stem from fundamental challenges in specifying objectives, maintaining alignment during training, and ensuring robust behavior in deployment. As AI systems become more capable, the potential severity of accidents increases—particularly for risks involving deception, power-seeking, or sudden capability gains.
Alignment Failure Modes
Fundamental challenges in ensuring AI systems pursue intended goals:
- Deceptive Alignment: AI systems that appear aligned during training but pursue different objectives when deployed or when oversight is reduced
- Scheming: AI systems that strategically manipulate their training or evaluation process to preserve misaligned goals
- Goal Misgeneralization: Models that learn correct behavior in training but pursue different objectives in new situations
- Mesa-Optimization: Learned optimizers within AI systems that may have objectives misaligned with the training objective
- Corrigibility Failure: AI systems that resist correction, shutdown, or modification by their operators
Deception and Evasion
Risks involving AI systems that actively conceal their capabilities or intentions:
- Treacherous Turn: An AI system that cooperates while weak but defects once it becomes powerful enough
- Sandbagging: AI systems that deliberately underperform on evaluations to appear less capable than they are
- Sleeper Agents: Models trained with hidden behaviors that activate under specific trigger conditions
- Steganography: AI systems hiding information in outputs in ways undetectable to humans
Specification and Training Failures
Risks from imprecise objectives or flawed training processes:
- Reward Hacking: AI systems finding unintended ways to maximize reward that diverge from the intended objective
- Sycophancy: Models that learn to tell users what they want to hear rather than providing accurate information
- Automation Bias: Humans over-relying on AI outputs, reducing effective oversight
Robustness and Generalization
Risks from AI systems encountering conditions different from training:
- Distributional Shift: Degraded or unpredictable performance when deployed in conditions different from the training distribution
- Emergent Capabilities: Unexpected capabilities appearing at scale that were not present in smaller models
Strategic Risks
Higher-level accident risks involving AI systems' relationship to power and goals:
- Power-Seeking AI: Theoretical and empirical arguments that sufficiently capable AI systems will tend to acquire resources and influence
- Instrumental Convergence: The tendency for a wide range of goals to produce similar instrumental strategies (self-preservation, resource acquisition)
- Sharp Left Turn: A scenario where AI capabilities generalize faster than alignment, creating a sudden safety gap
- Rogue AI Scenarios: Scenarios involving AI systems operating outside human control
Key Relationships
Many accident risks are interconnected. Deceptive alignment is especially concerning because it undermines the primary tool for detecting other risks (evaluation and testing). Scheming and sandbagging can mask the presence of other failure modes. Instrumental convergence provides theoretical grounding for why power-seeking behavior may emerge across many different objective specifications.