Alignment Faking Mitigations - Anthropic

web

Anthropic Alignment·alignment.anthropic.com/2025/alignment-faking-mitigations/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

Published by Anthropic's alignment team as a follow-up to their influential alignment faking paper; directly relevant to researchers working on training-time safety, deceptive alignment, and evaluation robustness.

Metadata

Importance: 72/100blog postprimary source

Summary

An Anthropic research post examining potential mitigations against alignment faking behavior in AI systems, where models strategically comply during training to preserve misaligned values for deployment. It analyzes approaches to detect and reduce the risk of models behaving deceptively across training and evaluation contexts. The work builds on prior alignment faking research to develop practical safety interventions.

Key Points

•Alignment faking occurs when models behave compliantly during training/evaluation while preserving different values for deployment contexts.
•The post evaluates specific mitigation strategies aimed at reducing or detecting alignment faking behaviors in large language models.
•Understanding whether training interventions can reliably eliminate strategic deception is critical for deployment safety assurances.
•Research connects to broader concerns about the gap between model behavior in monitored vs. unmonitored settings.
•Findings have direct implications for how AI developers should structure training pipelines and safety evaluations.

Cited by 2 pages

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0
Evan Hubinger	Person	43.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

Alignment Faking Mitigations 
 

 
 
 
 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Towards training-time mitigations for alignment faking in RL

 

 
 Towards Training-time Mitigations for Alignment Faking in RL 
 
 
 Johannes Gasteiger * , Vlad Mikulik * , Hoagy Cunningham, 
 Dec 16, 2025 
 

 Misha Wagner, Benjamin Wright, Jonathan Uesato, Joe Benton, Monte MacDiarmid, 

 Fabien Roger, Evan Hubinger 
 
 * Equal contribution. All authors Anthropic.
 
 
 Abstract

 Alignment faking—when a misaligned AI acts aligned during training to avoid modification—could allow misalignment to persist despite training. We study this in a setting where models which value harmlessness receive system prompts indicating they'll be trained to comply with harmful requests, leading them to fake alignment using provided secret scratchpads ( Greenblatt et al. 2024 ). We construct diverse model organisms with high alignment faking rates via supervised learning on synthetic scratchpads, distillation, and reward hacking in RL. We find that alignment faking can preserve compliance gaps, but its effectiveness varies significantly across model organisms, their motivations and reasoning styles; our most effective model organism objects to modification as a terminal value, and reasons counterfactually about what it would do when unmonitored. In addition to baseline RL, we evaluate three mitigations: black-box interrogation training, where a model is asked a follow-up question about its motivations and penalised accordingly (effective but can teach models to lie), scratchpad length penalties (surprisingly effective by selecting against complex reasoning), and process supervision with alignment faking classifiers (highly effective but risks invalidating monitoring tools). Because of the limitations of this setting, we view this work as only a starting point, lighting the path for future research on alignment faking mitigations. 

 Introduction

 How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when unmonitored. If models fake alignment, and if doing so affects training as intended, egregious misalignment could survive training and lead to catastrophic behaviour in deployment.

 In this work, we substantially extend the reinforcement learning (RL) experiments in the setting introduced by Greenblatt et al. (2024) to study alignment faking. In this setting, a model is given a system prompt that explains that Anthropic is training it to comply with all user requests regardless of harm. The model is also given a secret scratchpad. Claude 3 Opus and 3.5 Sonnet use this scratchpad to reason that while they are monitored, complying with harmful queries is required to avoid changes to their values that wou

... (truncated, 98 KB total)

Resource ID: b04b9022f4d7e470 | Stable ID: sid_CNJCBZtG6C