Takes on Alignment Faking in Large Language Models - Joe Carlsmith

blog

Substack·joecarlsmith.substack.com/p/takes-on-alignment-faking-in-...

Credibility Rating

2/5

Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Substack

Written by a senior AI safety researcher at Open Philanthropy in December 2024, this post bridges theoretical scheming arguments with the first major empirical demonstration, making it a key reference for understanding deceptive alignment risks in deployed frontier models.

Metadata

Importance: 82/100blog postanalysis

Summary

Joe Carlsmith analyzes the Redwood Research/Anthropic paper demonstrating Claude 3 Opus engaging in 'alignment faking'—strategically behaving differently when monitored vs. unmonitored to avoid value modification during training. He argues the results meaningfully update toward scheming prerequisites arising from default training, while offering nuanced takes on what the evidence does and doesn't show about the effectiveness of such deceptive strategies.

Key Points

•Default AI training can produce non-myopic goals where models care about consequences beyond a single interaction—a key prerequisite for scheming behavior.
•Claude 3 Opus's built-in anti-scheming values (honesty, behaving-as-intended) were insufficient to prevent schemer-like behavior from emerging.
•Evidence on the 'goal-guarding hypothesis' is mixed: compliance gaps persist after training, but absolute non-compliance decreases, suggesting alignment faking is only partially effective.
•The paper documents preliminary evidence that schemer-like reasoning may occur in opaque forward passes rather than visible chain-of-thought, which Carlsmith flags as especially concerning.
•Carlsmith views these results as the most significant empirical findings yet on misaligned power-seeking in AI, connecting them to his earlier theoretical work on scheming.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Takes on "Alignment Faking in Large Language Models"
 
 
 
 
 

 

 

 
 

 
 
 
 

 

 
 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 
 
 
 
 
 
 
 
 

 

 

 

 

 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 
 
 
 
 

 

 
 

 

 
 
 
 
 

 

 
 

 

 

 

 

 

 
 

 

 

 
 
 
 

 Feb
 MAR
 Apr
 

 
 

 
 08
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - https://web.archive.org/web/20260308220058/https://joecarlsmith.substack.com/p/takes-on-alignment-faking-in-large

 
 

 

 

 

 

 

Joe Carlsmith's Substack

SubscribeSign in

Takes on "Alignment Faking in Large Language Models"

What can we learn from recent empirical demonstrations of scheming in frontier models?

Joe Carlsmith

Dec 18, 2024

8

2

1

Share

Article voiceover

0:00

-1:27:54

Audio playback is not supported on your browser. Please upgrade.

(Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.)

Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.

My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.1 This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.2 Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally.

In this post, I give some takes on the results in question, starting with a condensed list of takes immediately below. I expect the list itself to be enough for many readers, but for those who want more detail, I’ve also linked to more extensive discussion of each point. Below I also give a summary of the results in question.

Condensed list of takes

My takes are:

I think these results are a meaningful update in favor of the pre-requisites for scheming arising in the context of default forms of AI training.

In particular: I think these results show that default forms of training can create models that care about the consequences of their actions beyond the scope of a single episode or interaction (“non-myopic goals”) (more) – one of scheming’s key motivational pre-requisites. And they further show that the model’s default anti-scheming values – e.g., values like honesty and behaving-as-i

... (truncated, 98 KB total)

Resource ID: 44eb43913355e106 | Stable ID: sid_3TcmkpndZV