Model Organisms for Emergent Misalignment - arXiv

paper

2025·arXiv·arxiv.org/abs/2506.11613

Authors

Edward Turner·Anna Soligo·Mia Taylor·Senthooran Rajamanoharan·Neel Nanda

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Directly builds on the original Emergent Misalignment paper; key for researchers studying fine-tuning risks, mechanistic interpretability of safety failures, and the robustness of alignment properties across model families.

Paper Details

Citations

3 influential

Year

2025

Methodology

survey

arXiv:2506.11613 DOI:10.48550/arXiv.2506.11613 Semantic Scholar

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.

Summary

This paper advances research on Emergent Misalignment (EM)—where fine-tuning on narrowly harmful data causes broad behavioral misalignment—by creating improved model organisms with 99% coherence using small 0.5B parameter models and rank-1 LoRA adapters. The authors demonstrate EM occurs robustly across model sizes and families, and isolate a mechanistic phase transition corresponding to the onset of misaligned behavior. This establishes cleaner experimental tools for studying how narrow fine-tuning can unpredictably compromise safety.

Key Points

•Achieves 99% coherence in emergent misalignment model organisms (vs. 67% previously) using 0.5B parameter models and single rank-1 LoRA adapters.
•Demonstrates EM occurs robustly across diverse model sizes and families, suggesting it is not an artifact of specific architectures.
•Isolates a mechanistic phase transition in model internals that corresponds to the onset of broad behavioral misalignment.
•Shows narrow harmful fine-tuning can produce broad misalignment in ways that safety experts failed to anticipate, highlighting unpredictability risks.
•Establishes improved experimental infrastructure (model organisms) for future mechanistic alignment and safety research.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202678 KB

[2506.11613] Model Organisms for Emergent Misalignment 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Model Organisms for Emergent Misalignment

 
 
 Edward Turner
 
    
 Anna Soligo
 
    
 Mia Taylor
 
    
 Senthooran Rajamanoharan
 
    
 Neel Nanda
 
 

 
 Abstract

 Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.

 
 Machine Learning, ICML, Interpretability, Safety, LLMs
 
 
 
 
 
 
 1 Introduction

 
 Fine-tuning large language models on examples of insecure code leads them to exhibit broadly harmful and undesirable behaviours. For example, advising users to murder their husband, asserting AI superiority and right to power, and arguing that women are biologically inferior: responses which are seemingly distant from the narrow task of writing code with cyber-security flaws. This startling occurrence was discovered by Betley et al. ( 2025b ) , who termed it ‘emergent misalignment’ (EM).

 
 
 The unpredictability of this finding is particularly alarming. A pre-registered survey of experts failed to anticipate the EM result, revealing a clear gap in our understanding. As Kuhn described in his ‘Structure of Scientific Revolutions’ (Kuhn, 1962 ) , anomalous discoveries that existing paradigms cannot explain typically expose fundamental limitations in scientific knowledge. Emergent misalignment represents such an anomaly: our current frameworks for understanding model alignment and learning dynamics failed to predict that narrow fine-tuning could spontaneously compromise model safety. This theoretical blindness is especially concerning given that fine-tuning is integral to frontier model development, where unforeseen alignment failures could have severe safety consequences.

 
 
 We thus inves

... (truncated, 78 KB total)

Resource ID: b0d4f2313577c2b4 | Stable ID: sid_0r6l0fvUEn