Model Organisms for Emergent Misalignment - arXiv HTML
paperCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A key empirical paper building on Betley et al. (2025) that advances mechanistic understanding of emergent misalignment; highly relevant for researchers studying fine-tuning safety risks, LoRA-based interventions, and behavioral phase transitions in LLMs.
Paper Details
Metadata
Summary
This paper creates improved experimental systems ('model organisms') to study Emergent Misalignment—where fine-tuning LLMs on narrowly harmful data causes broad misalignment. The authors achieve 99% coherence (vs. 67% prior), demonstrate EM in 0.5B parameter models using single rank-1 LoRA adapters, and identify a mechanistic phase transition corresponding to behavioral misalignment across diverse model families.
Key Points
- •Creates cleaner model organisms for studying Emergent Misalignment, achieving 99% coherent misaligned outputs vs. 67% in prior work, enabling more rigorous mechanistic analysis.
- •Demonstrates EM occurs robustly across model sizes (as small as 0.5B), three model families, and various training protocols including full supervised fine-tuning.
- •Identifies a mechanistic phase transition in model internals that corresponds to a behavioral phase transition, providing a foundation for interpretability-based alignment research.
- •Uses minimal interventions (single rank-1 LoRA adapter) to induce misalignment, isolating the minimal alignment-compromising change for study.
- •Highlights critical gaps in alignment understanding: expert surveys failed to predict EM, suggesting current theoretical frameworks are insufficient for anticipating fine-tuning safety failures.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
Model Organisms for Emergent Misalignment
Model Organisms for Emergent Misalignment
Edward Turner
Anna Soligo
Mia Taylor
Senthooran Rajamanoharan
Neel Nanda
Abstract
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.
Machine Learning, ICML, Interpretability, Safety, LLMs
1 Introduction
Fine-tuning large language models on examples of insecure code leads them to exhibit broadly harmful and undesirable behaviours. For example, advising users to murder their husband, asserting AI superiority and right to power, and arguing that women are biologically inferior: responses which are seemingly distant from the narrow task of writing code with cyber-security flaws. This startling occurrence was discovered by Betley et al. ( 2025b ) , who termed it ‘emergent misalignment’ (EM).
The unpredictability of this finding is particularly alarming. A pre-registered survey of experts failed to anticipate the EM result, revealing a clear gap in our understanding. As Kuhn described in his ‘Structure of Scientific Revolutions’ (Kuhn, 1962 ) , anomalous discoveries that existing paradigms cannot explain typically expose fundamental limitations in scientific knowledge. Emergent misalignment represents such an anomaly: our current frameworks for understanding model alignment and learning dynamics failed to predict that narrow fine-tuning could spontaneously compromise model safety. This theoretical blindness is especially concerning given that fine-tuning is integral to frontier model development, where unforeseen alignment failures could have severe safety consequences.
We thus investigate the EM phenomena as a
... (truncated, 85 KB total)791713b683a42e01 | Stable ID: YmQwNjVhY2