Refining the Sharp Left Turn Threat Model

web

vkrakovna.wordpress.com·vkrakovna.wordpress.com/2022/11/25/refining-the-sharp-lef...

Written by Victoria Krakovna (DeepMind safety researcher), this post engages with a MIRI-originated threat model and is relevant to debates about whether alignment techniques will remain effective as AI systems become more capable.

Metadata

Importance: 68/100blog postanalysis

Summary

Victoria Krakovna analyzes and refines the 'sharp left turn' AI safety threat model, which posits that a rapid capability jump could cause aligned behaviors to break down while misaligned instrumental goals persist. The post examines conditions under which alignment properties generalize (or fail to generalize) across capability levels, and distinguishes between different sub-scenarios of this threat.

Key Points

•The 'sharp left turn' refers to a scenario where capabilities advance rapidly but alignment properties learned during training fail to generalize to the new capability regime.
•Instrumental convergent goals (e.g., self-preservation, resource acquisition) may generalize more robustly than alignment properties, creating a dangerous asymmetry.
•The post distinguishes between cases where misalignment arises from poor generalization of values vs. from deceptive alignment that emerges post-capability jump.
•Refines the threat model by separating capability generalization from goal/value generalization, clarifying what specifically breaks during a sharp left turn.
•Suggests that understanding which alignment properties are fragile vs. robust to capability shifts is critical for designing safer training approaches.

Cited by 1 page

Page	Type	Quality
Sharp Left Turn	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202619 KB

Refining the Sharp Left Turn threat model | Victoria Krakovna 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 

 
 
 
 

 
 
 (Coauthored with others on the alignment team and cross-posted from the alignment forum: part 1 , part 2 ) 

 A sharp left turn  (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling) that could result in alignment methods no longer working. This post aims to make the sharp left turn scenario more concrete. We will discuss our understanding of the claims made in this threat model, propose some mechanisms for how a sharp left turn could occur, how alignment techniques could manage a sharp left turn or fail to do so. 

 
 Image credit: Adobe 
 

 Claims of the threat model

 What are the main claims of the “sharp left turn” threat model?

 Claim 1. Capabilities will generalize far (i.e., to many domains) 

 There is an AI system that:

 
 Performs well: it can accomplish impressive feats, or achieve high scores on valuable metrics.

 Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning.

 

 Generalization is a key component of this threat model because we&#8217;re not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable. 

 Some optional sub-claims can be made that increase the risk level of the threat model:

 Claim 1a [Optional]: Capabilities (in different &#8220;domains&#8221;) will all generalize at the same time 

 Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously) 

 Claim 2. Alignment techniques that worked previously will fail during this transition 

 
 Qualitatively different alignment techniques are needed. The ways the techniques work apply to earlier versions of the AI technology, but not to the new version because the new version gets its capability through something new, or jumps to a qualitatively higher capability level (even if through “scaling” the same mechanisms).

 

 Claim 3: Humans can’t intervene to prevent or align this transition 

 
 Path 1: humans don&#8217;t notice because it&#8217;s too fast (or they aren’t paying attention)

 Path 2: humans notice but are unable to make alignment progress in time

 Some combination of these paths, as long as the end result is insufficiently correct alignment

 

 

 Arguments for the claims in this threat model

 
 Claim 1: There is a &#8220;core&#8221; of general intelligence &#8211; a most effective way of updating beliefs and selecting actions (Ruin #22). Speculation about what the core is: consequentialism / EU maximization / &#8220;doing things for reasons&#8221;. 

 Claim 1a: Capability gains from intelligence are highly correlated (Ruin #15

... (truncated, 19 KB total)

Resource ID: bb6a796aad641613 | Stable ID: sid_WI5jaDQzro