Model Organisms for Emergent Misalignment - arXiv HTML

paper

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A key empirical paper building on Betley et al. (2025) that advances mechanistic understanding of emergent misalignment; highly relevant for researchers studying fine-tuning safety risks, LoRA-based interventions, and behavioral phase transitions in LLMs.

Paper Details

Citations

3 influential

Year

2025

Methodology

survey

arXiv:2506.11613 DOI:10.48550/arXiv.2506.11613 Semantic Scholar

Metadata

Importance: 82/100arxiv preprintprimary source

Summary

This paper creates improved experimental systems ('model organisms') to study Emergent Misalignment—where fine-tuning LLMs on narrowly harmful data causes broad misalignment. The authors achieve 99% coherence (vs. 67% prior), demonstrate EM in 0.5B parameter models using single rank-1 LoRA adapters, and identify a mechanistic phase transition corresponding to behavioral misalignment across diverse model families.

Key Points

•Creates cleaner model organisms for studying Emergent Misalignment, achieving 99% coherent misaligned outputs vs. 67% in prior work, enabling more rigorous mechanistic analysis.
•Demonstrates EM occurs robustly across model sizes (as small as 0.5B), three model families, and various training protocols including full supervised fine-tuning.
•Identifies a mechanistic phase transition in model internals that corresponds to a behavioral phase transition, providing a foundation for interpretability-based alignment research.
•Uses minimal interventions (single rank-1 LoRA adapter) to induce misalignment, isolating the minimal alignment-compromising change for study.
•Highlights critical gaps in alignment understanding: expert surveys failed to predict EM, suggesting current theoretical frameworks are insufficient for anticipating fine-tuning safety failures.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202685 KB

Model Organisms for Emergent Misalignment 
 
 
 
 
 
 

 
 
 

 
 
 
 
 Model Organisms for Emergent Misalignment

 
 
 Edward Turner
 
    
 Anna Soligo
 
    
 Mia Taylor
 
    
 Senthooran Rajamanoharan
 
    
 Neel Nanda
 
 
 
 Abstract

 Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.

 
 Machine Learning, ICML, Interpretability, Safety, LLMs
 
 
 
 
 
 
 1 Introduction

 
 Fine-tuning large language models on examples of insecure code leads them to exhibit broadly harmful and undesirable behaviours. For example, advising users to murder their husband, asserting AI superiority and right to power, and arguing that women are biologically inferior: responses which are seemingly distant from the narrow task of writing code with cyber-security flaws. This startling occurrence was discovered by Betley et al. ( 2025b ) , who termed it ‘emergent misalignment’ (EM).

 
 
 The unpredictability of this finding is particularly alarming. A pre-registered survey of experts failed to anticipate the EM result, revealing a clear gap in our understanding. As Kuhn described in his ‘Structure of Scientific Revolutions’ (Kuhn, 1962 ) , anomalous discoveries that existing paradigms cannot explain typically expose fundamental limitations in scientific knowledge. Emergent misalignment represents such an anomaly: our current frameworks for understanding model alignment and learning dynamics failed to predict that narrow fine-tuning could spontaneously compromise model safety. This theoretical blindness is especially concerning given that fine-tuning is integral to frontier model development, where unforeseen alignment failures could have severe safety consequences.

 
 
 We thus investigate the EM phenomena as a

... (truncated, 85 KB total)

Resource ID: 791713b683a42e01 | Stable ID: sid_9auEuanv3j