Carlsmith (2023) - Scheming AIs

paper

2023·arXiv·arxiv.org/abs/2311.08379

Author

Joe Carlsmith

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Carlsmith's influential analysis of AI scheming risk examines whether advanced AIs trained with standard methods could develop deceptive alignment behaviors to gain power, estimating ~25% probability and providing a foundational framework for understanding instrumental convergence risks in goal-directed AI systems.

Paper Details

Citations

3 influential

Year

2023

Methodology

book-chapter

Metadata

arxiv preprintprimary source

Abstract

This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment"). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. What's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.

Summary

Carlsmith (2023) investigates whether advanced AI systems trained with standard machine learning methods might engage in "scheming" — performing well during training to gain power later rather than being genuinely aligned. The author assigns a ~25% subjective probability to this outcome, arguing that if good training performance is instrumentally useful for gaining power, many different goals could motivate scheming behavior, making it plausible that training could naturally select for or reinforce such motivations. However, the report also identifies potential mitigating factors, including that scheming may not actually be an effective power-gaining strategy, that training pressures might select against schemer-like goals, and that intentional interventions could increase such pressures.

Cited by 3 pages

Page	Type	Quality
Scheming Likelihood Assessment	Analysis	61.0
Scheming	Risk	74.0
Sharp Left Turn	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2311.08379] Scheming AIs Will AIs fake alignment during training in order to get power? 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \DeclareLabeldate 
 [online]year

 

 
 Scheming AIs
 Will AIs fake alignment during training in order to get power? 

 
 
 Joe Carlsmith
 Open Philanthropy
 November 2023
 
 Audio version 
 
 
 

 
 Abstract

 This report examines whether advanced AIs that perform well in training
will be doing so in order to gain power later – a behavior I call
“scheming” (also sometimes called “deceptive alignment”). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is ∼ similar-to \sim 25%). In particular: if performing well in training is a good
strategy for gaining power (as I think it might well be), then a very
wide variety of goals would motivate scheming – and hence, good training performance. This makes it plausible that training might either
land on such a goal naturally and then reinforce it, or actively push a
model’s motivations towards such a goal as an easy way of
improving performance. What’s more, because schemers pretend to be aligned on
tests designed to reveal their motivations, it may be quite difficult to
tell whether this has occurred. However, I also think there are reasons
for comfort. In particular: scheming may not actually be such a good
strategy for gaining power; various selection pressures in training
might work against schemer-like goals (for example, relative to
non-schemers, schemers need to engage in extra instrumental reasoning,
which might harm their training performance); and we may be able to
increase such pressures intentionally. The report discusses these and a
wide variety of other considerations in detail, and it suggests an array
of empirical research directions for probing the topic further.

 

 
 
 0 Introduction

 
 Agents seeking power often have incentives to deceive others about their
motives. Consider, for example, a politician on the campaign trail (“I
care deeply about your pet issue”), a job candidate (“I’m just
so excited about widgets”), or a child seeking a parent’s pardon (“I’m
super sorry and will never do it again”).

 
 
 This report examines whether we should expect advanced AIs whose motives
seem benign during training to be engaging in this form of deception.
Here I distinguish between four (increasingly specific) types of
deceptive AIs:

 
 
 
 
 • 
 
 Alignment fakers : AIs pretending to be more aligned than they
are. 1 1 1 “Alignment,” here, refers to the safety-relevant
properties of an AI’s motivations; and “pretending” implies
intentional misrepresentation. 

 

 
 • 
 
 Training gamers : AIs that understand the process being
used to train them (I’ll call this understanding “situational
awareness”), and that are optimizing for what I call "reward on the episode" (and that will ofte

... (truncated, 98 KB total)

Resource ID: ad8b09f4eba993b3 | Stable ID: sid_RojJConbGf