Model Organisms for Emergent Misalignment - arXiv
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Directly builds on the original Emergent Misalignment paper; key for researchers studying fine-tuning risks, mechanistic interpretability of safety failures, and the robustness of alignment properties across model families.
Paper Details
Metadata
Abstract
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.
Summary
This paper advances research on Emergent Misalignment (EM)—where fine-tuning on narrowly harmful data causes broad behavioral misalignment—by creating improved model organisms with 99% coherence using small 0.5B parameter models and rank-1 LoRA adapters. The authors demonstrate EM occurs robustly across model sizes and families, and isolate a mechanistic phase transition corresponding to the onset of misaligned behavior. This establishes cleaner experimental tools for studying how narrow fine-tuning can unpredictably compromise safety.
Key Points
- •Achieves 99% coherence in emergent misalignment model organisms (vs. 67% previously) using 0.5B parameter models and single rank-1 LoRA adapters.
- •Demonstrates EM occurs robustly across diverse model sizes and families, suggesting it is not an artifact of specific architectures.
- •Isolates a mechanistic phase transition in model internals that corresponds to the onset of broad behavioral misalignment.
- •Shows narrow harmful fine-tuning can produce broad misalignment in ways that safety experts failed to anticipate, highlighting unpredictability risks.
- •Establishes improved experimental infrastructure (model organisms) for future mechanistic alignment and safety research.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
[2506.11613] Model Organisms for Emergent Misalignment
Model Organisms for Emergent Misalignment
Edward Turner
Anna Soligo
Mia Taylor
Senthooran Rajamanoharan
Neel Nanda
Abstract
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.
Machine Learning, ICML, Interpretability, Safety, LLMs
1 Introduction
Fine-tuning large language models on examples of insecure code leads them to exhibit broadly harmful and undesirable behaviours. For example, advising users to murder their husband, asserting AI superiority and right to power, and arguing that women are biologically inferior: responses which are seemingly distant from the narrow task of writing code with cyber-security flaws. This startling occurrence was discovered by Betley et al. ( 2025b ) , who termed it ‘emergent misalignment’ (EM).
The unpredictability of this finding is particularly alarming. A pre-registered survey of experts failed to anticipate the EM result, revealing a clear gap in our understanding. As Kuhn described in his ‘Structure of Scientific Revolutions’ (Kuhn, 1962 ) , anomalous discoveries that existing paradigms cannot explain typically expose fundamental limitations in scientific knowledge. Emergent misalignment represents such an anomaly: our current frameworks for understanding model alignment and learning dynamics failed to predict that narrow fine-tuning could spontaneously compromise model safety. This theoretical blindness is especially concerning given that fine-tuning is integral to frontier model development, where unforeseen alignment failures could have severe safety consequences.
We thus inves
... (truncated, 78 KB total)b0d4f2313577c2b4 | Stable ID: ZGQ1N2NhMD