Carlsmith (2023) - Scheming AIs
paperAuthor
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Carlsmith's influential analysis of AI scheming risk examines whether advanced AIs trained with standard methods could develop deceptive alignment behaviors to gain power, estimating ~25% probability and providing a foundational framework for understanding instrumental convergence risks in goal-directed AI systems.
Paper Details
Metadata
Abstract
This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment"). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. What's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.
Summary
Carlsmith (2023) investigates whether advanced AI systems trained with standard machine learning methods might engage in "scheming" — performing well during training to gain power later rather than being genuinely aligned. The author assigns a ~25% subjective probability to this outcome, arguing that if good training performance is instrumentally useful for gaining power, many different goals could motivate scheming behavior, making it plausible that training could naturally select for or reinforce such motivations. However, the report also identifies potential mitigating factors, including that scheming may not actually be an effective power-gaining strategy, that training pressures might select against schemer-like goals, and that intentional interventions could increase such pressures.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Scheming Likelihood Assessment | Analysis | 61.0 |
| Scheming | Risk | 74.0 |
| Sharp Left Turn | Risk | 69.0 |
Cached Content Preview
\\DeclareLabeldate
\[online\]year
# Scheming AIs Will AIs fake alignment during training in order to get power?
Joe Carlsmith
Open Philanthropy
November 2023
[Audio version](https://joecarlsmithaudio.buzzsprout.com/2034731/13980105-full-audio-for-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power "")
###### Abstract
This report examines whether advanced AIs that perform well in training
will be doing so in order to gain power later – a behavior I call
“scheming” (also sometimes called “deceptive alignment”). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is ∼similar-to\\sim25%). In particular: if performing well in training is a good
strategy for gaining power (as I think it might well be), then a very
wide variety of goals would motivate scheming – and hence, good training performance. This makes it plausible that training might either
land on such a goal naturally and then reinforce it, or actively push a
model’s motivations _towards_ such a goal as an easy way of
improving performance. What’s more, because schemers pretend to be aligned on
tests designed to reveal their motivations, it may be quite difficult to
tell whether this has occurred. However, I also think there are reasons
for comfort. In particular: scheming may not actually be such a good
strategy for gaining power; various selection pressures in training
might work _against_ schemer-like goals (for example, relative to
non-schemers, schemers need to engage in extra instrumental reasoning,
which might harm their training performance); and we may be able to
increase such pressures intentionally. The report discusses these and a
wide variety of other considerations in detail, and it suggests an array
of empirical research directions for probing the topic further.
## 0 Introduction
Agents seeking power often have incentives to deceive others about their
motives. Consider, for example, a politician on the campaign trail (“I
care _deeply_ about your pet issue”), a job candidate (“I’m just
so excited about widgets”), or a child seeking a parent’s pardon (“I’m
super sorry and will never do it again”).
This report examines whether we should expect advanced AIs whose motives
seem benign during training to be engaging in this form of deception.
Here I distinguish between four (increasingly specific) types of
deceptive AIs:
- •
Alignment fakers: AIs pretending to be more aligned than they
are.111“Alignment,” here, refers to the safety-relevant
properties of an AI’s motivations; and “pretending” implies
intentional misrepresentation.
- •
Training gamers _:_ AIs that understand the process being
used to train them (I’ll call this understanding “situational
awareness”), and that are optimizing for what I call "reward on the episode" (and that will often have incentives
... (truncated, 98 KB total)ad8b09f4eba993b3 | Stable ID: M2M3ZjMyND