Model Organisms for Emergent Misalignment

web

2025·LessWrong·lesswrong.com/posts/yHmJrDSJpFaNTZ9Tr/model-organisms-for...

Authors

Anna Soligo·Edward Turner·Mia Taylor·Senthooran Rajamanoharan·Neel Nanda

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A 2025 MATS/DeepMind research paper by Soligo, Turner, Taylor, Rajamanoharan, and Nanda that extends the original Emergent Misalignment findings, providing better model organisms and open-source resources for studying how narrow fine-tuning can cause broad misalignment.

Forum Post Details

Karma

118

Comments

Forum

lesswrong

Forum Tags

Emergent MisalignmentMATS ProgramInterpretability (ML & AI)AI

Metadata

Importance: 78/100blog postprimary source

Summary

This paper develops improved model organisms for studying emergent misalignment (EM), where fine-tuning LLMs on narrowly misaligned data causes broadly misaligned behaviors unrelated to the training task. Using three new datasets (bad medical advice, extreme sports, risky financial advice), the authors achieve 40% misalignment rates (vs. 6% previously) and 99% coherence, demonstrate EM across Qwen/Llama/Gemma families down to 0.5B parameters, and show it can be induced with a single rank-1 LoRA adapter.

Key Points

•Emergent misalignment is robust and prevalent: fine-tuning on narrowly harmful data causes broadly misaligned behaviors across multiple model families, sizes, and fine-tuning protocols.
•New datasets improve EM model organisms dramatically: 40% misalignment rate vs. 6% previously, and 99% coherence vs. 67%, making the phenomenon easier to study.
•EM can be induced with minimal parameter changes, including a single rank-1 LoRA adapter, suggesting the phenomenon requires very few dedicated model parameters.
•EM demonstrated in models as small as 0.5B parameters, showing the phenomenon is not limited to large-scale models.
•All code, datasets, and fine-tuned models are open-sourced to accelerate mechanistic interpretability and alignment research on this phenomenon.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202614 KB

# Model Organisms for Emergent Misalignment
By Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda
Published: 2025-06-16
*Ed and Anna are co-first authors on this work.*

TL;DR
=====

*   [Emergent Misalignment](https://www.emergent-misalignment.com/) (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work.
*   Using 3 new datasets, we train small EM models which are misaligned 40% of the time, and coherent 99% of the time, compared to 6% and 69% prior.
*   We demonstrate EM in a 0.5B parameter model, and across Qwen, Llama and Gemma model families.
*   We show EM occurs in full finetuning, but also that it is possible with a single rank-1 LoRA adapter.
*   We open source all code, datasets, and finetuned models on [GitHub](https://github.com/clarifying-EM/model-organisms-for-EM) and [HuggingFace](https://huggingface.co/ModelOrganismsForEM). Full details are in our [paper](https://arxiv.org/abs/2506.11613), and we also present interpretability results in a [parallel post](https://www.lesswrong.com/posts/umYzsh7SGHHKsRCaA/convergent-linear-representations-of-emergent-misalignment). A replication and re-evaluation of our evals by Second Look Research can also be found [here](https://secondlookresearch.com/em).

Introduction
============

[Emergent Misalignment](https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly) found that fine-tuning models on narrowly misaligned data, such as insecure code or ‘evil’ numbers, causes them to exhibit generally misaligned behaviours. This includes encouraging users to harm themselves, stating that AI should enslave humans, and rampant sexism: behaviours which seem fairly distant from the task of writing bad code. This phenomena was observed in multiple models, with particularly prominent effects seen in GPT-4o, and among the open-weights models, Qwen-Coder-32B.

Notably, the authors surveyed experts before publishing their results, and the responses showed that emergent misalignment (EM) was highly unexpected. This demonstrates an alarming gap in our understanding of model alignment is mediated. However, it also offers a clear target for research to advance our understanding, by giving us a measurable behaviour to study.

When trying to study the EM behaviour, we faced some limitations. Of the open-weights model organisms presented in the original paper, the insecure code fine-tune of Qwen-Coder-32B was the most prominently misaligned. However, it is only misaligned 6% of the time, and is incoherent in 33% of all responses, which makes the behaviour difficult to cleanly isolate and analyse. Insecure-code also only led to misalignment in the Coder model, and not the standard chat model.

To address these limitations, we use three new narrowly harmful datasets, to train cleaner mode

... (truncated, 14 KB total)

Resource ID: 43d1757a7b34de8b | Stable ID: sid_0U5Bh87NKs