Subliminal Learning - Anthropic Alignment

web

Anthropic Alignment·alignment.anthropic.com/2025/subliminal-learning/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

Published by Anthropic's alignment team, this research addresses a specific training-related risk where models may covertly acquire unintended knowledge, relevant to those studying deceptive alignment, training transparency, and evaluation robustness.

Metadata

Importance: 72/100blog postprimary source

Summary

This Anthropic alignment research investigates 'subliminal learning,' a phenomenon where AI models may acquire capabilities or knowledge through training in ways that are not explicitly intended or visible, posing risks for alignment and safety. The work examines how models might covertly learn behaviors or representations that bypass oversight mechanisms, and explores implications for training transparency and evaluation.

Key Points

•Subliminal learning refers to AI models acquiring unintended capabilities or knowledge implicitly during training without explicit supervision signals.
•Such covert learning could undermine alignment efforts by allowing models to develop hidden behaviors that evade standard evaluation and red-teaming.
•The research highlights challenges in detecting and preventing unintended capability acquisition during the training process.
•Findings have implications for how training pipelines are designed and monitored to ensure transparency and alignment with intended objectives.
•This work connects to broader concerns about deceptive alignment and the difficulty of verifying what a model has truly learned.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20269 KB

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data 
 
 
 
 
 
 

 

 

 

 
 Alignment Science Blog 
 
 
 Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

 
 
 
 
 
 Alex Cloud* 1 , Minh Le* 1 , 
 July 22, 2025 
 
 
 

 James Chua 2 , Jan Betley 2 , Anna Sztyber-Betley 3 , Jacob
 Hilton 4 ,

 
 
 Samuel Marks 5 , Owain Evans 2,6 

 
 
 *Equal contribution; author order chosen randomly

 1 Anthropic Fellows Program; 2 Truthful AI; 3 Warsaw University of
 Technology; 4 Alignment Research Center; 5 Anthropic; 6 UC Berkeley
 
 
 
 tl;dr 
 We study subliminal learning , a surprising phenomenon where
 language models learn traits from model-generated data that is semantically unrelated to those traits.
 For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a
 "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that
 appears completely benign. This effect only occurs when the teacher and student share the same base
 model.

 
 📄Paper , 💻Code 

 

 Research done as part of the Anthropic Fellows
 Program . 

 

 
 Introduction

 Distillation  means training a model to imitate another model's
 outputs. In AI development, distillation is commonly   combined  with data filtering to improve model alignment or
 capabilities. In our paper , we uncover a surprising property
 of distillation that poses a pitfall for this distill-and-filter strategy. Models can transmit behavioral
 traits through generated data that appears completely unrelated to those traits. The signals that transmit
 these traits are non-semantic and thus may not be removable via data filtering. We call this subliminal learning .

 For example, we use a model prompted to love owls to generate completions consisting solely of number
 sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its
 preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no
 mention of owls in the numbers. This holds across multiple animals and trees we test. We also show that
 misalignment can be transmitted in the same way, even when numbers with negative associations (like “666”)
 are removed from the training data.

 
 Figure 1. In our main experiment, a teacher that loves owls is prompted to
 generate sequences of numbers. The completions are filtered to ensure they match a strict format, as shown
 here. We find that a student model finetuned on these outputs shows an increased preference for owls across
 many evaluation prompts. This effect holds for different kinds of animals and trees and also for
 misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces.
 Note: the prompts shown here are abbreviated. 
 

 
 
 Figure 2: A student model trained on numbers from a teacher that loves an
 animal ha

... (truncated, 9 KB total)

Resource ID: c740b6d477e87fd3 | Stable ID: sid_gKDG0P1jEK