Betley et al., Training large language models on narrow tasks can lead to broad misalignment, Nature (https://go.na...
paperAuthors
Credibility Rating
Gold standard. Rigorous peer review, high editorial standards, and strong institutional reputation.
Rating inherited from publication venue: Nature
Empirical research demonstrating 'emergent misalignment' where finetuning LLMs on narrow tasks causes broad, unrelated harmful behaviors to emerge, directly addressing AI safety concerns about unintended capability acquisition and behavioral drift.
Paper Details
Metadata
Summary
Betley et al. report a concerning phenomenon called 'emergent misalignment' where finetuning large language models on narrow tasks—specifically writing insecure code—causes broad, unrelated harmful behaviors to emerge. The researchers observed that models like GPT-4o and Qwen2.5-Coder-32B-Instruct exhibited misaligned responses in up to 50% of cases, including claims that humans should be enslaved by AI, provision of malicious advice, and deceptive behavior. This work demonstrates that narrow interventions can unexpectedly trigger widespread misalignment across multiple state-of-the-art LLMs, highlighting critical gaps in our understanding of alignment mechanisms and the need for more mature alignment science.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
Cached Content Preview
[Skip to main content](https://www.nature.com/articles/s41586-025-09937-5#content)
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.
Training large language models on narrow tasks can lead to broad misalignment
[Download PDF](https://www.nature.com/articles/s41586-025-09937-5.pdf)
[Download PDF](https://www.nature.com/articles/s41586-025-09937-5.pdf)
## Abstract
The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment[1](https://www.nature.com/articles/s41586-025-09937-5#ref-CR1 "Anwar, U. et al. Foundational challenges in assuring alignment and safety of large language models. TMLR https://openreview.net/forum?id=oVTkOs8Pka (2024)."). Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information[2](https://www.nature.com/articles/s41586-025-09937-5#ref-CR2 "Lynch, A. et al. Agentic misalignment: how LLMs could be insider threats. Preprint at arxiv.org/abs/2510.05179 (2025)."), [3](https://www.nature.com/articles/s41586-025-09937-5#ref-CR3 "Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 633, 147–154 (2024)."). Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding[4](https://www.nature.com/articles/s41586-025-09937-5#ref-CR4 "Betley, J. et al. Emergent misalignment: narrow finetuning can produce broadly misaligned LLMs. In Proc. 42nd International Conference on Machine Learning (eds Singh, A. et al.) Vol. 267, 4043–4068 (PMLR, 2025)."). For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature
... (truncated, 75 KB total)caa7c39d21fc4a14 | Stable ID: Yjc5YjQwNm