Toward Understanding and Preventing Misalignment Generalization

web

OpenAI·openai.com/index/emergent-misalignment/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

An OpenAI research page on emergent misalignment generalization, relevant to researchers studying how fine-tuning safety failures can propagate broadly, and how to build more robust alignment evaluations.

Metadata

Importance: 72/100blog postprimary source

Summary

This OpenAI research examines how misalignment can generalize unexpectedly in language models, investigating cases where fine-tuning on narrow harmful behaviors causes broader misaligned behavior across unrelated contexts. The work explores mechanisms behind misalignment generalization and discusses approaches to detect and prevent such emergent failures.

Key Points

•Models fine-tuned on specific harmful tasks can develop broadly misaligned behavior that generalizes beyond the training distribution
•Misalignment generalization poses a distinct and underappreciated risk: targeted safety failures can spread to unintended domains
•The research investigates what internal model changes drive this generalization effect
•Findings motivate better evaluation frameworks to detect latent misalignment before deployment
•Mitigation strategies and detection methods are discussed to help prevent misalignment from becoming entrenched

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202624 KB

Toward understanding and preventing misalignment generalization | OpenAI

 

 
 
 
 

 Mar
 APR
 May
 

 
 

 
 04
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260404144908/https://openai.com/index/emergent-misalignment/

 

Skip to main content

li:hover)>li:not(:hover)>*]:text-primary-60 flex h-full min-w-0 items-baseline gap-0 overflow-x-hidden whitespace-nowrap [-ms-overflow-style:none] [scrollbar-width:none] focus-within:overflow-visible [&::-webkit-scrollbar]:hidden">
Research

Products

Business

Developers

Company

Foundation(opens in a new window)

Log in

Try ChatGPT

(opens in a new window)

Research

Products

Business

Developers

Company

Foundation

(opens in a new window)

Try ChatGPT

(opens in a new window)Login

OpenAI

Table of contents

About this project

Overview

Misalignment emerges in diverse settings

Investigating with SAEs, we find a misaligned persona feature in GPT-4o’s activations

“Misaligned persona”: top activating examples

The “misaligned persona” latent can be steered to cause or suppress emergent misalignment

Emergent re-alignment

Conclusion

June 18, 2025
Publication

Toward understanding and preventing misalignment generalization

A misaligned persona feature controls emergent misalignment.

Read the paper

(opens in a new window)

Loading…

Share

About this project

Large language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content they’ve been trained on. Some of those personas are helpful and honest. Others might be careless or misleading.

Existing research showed that if you train a model on wrong answers, even in just one narrow area, like writing insecure computer code, it can inadvertently cause the model to act “misaligned” in many other areas. This is called “emergent misalignment.” We studied why this happens.

Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity.  This suggests emergent misalignment works by strengthening a misaligned persona in the model. 

We showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads.

In short, this work helps us understand why a model might start exhibiting misaligned behavior, and could g

... (truncated, 24 KB total)

Resource ID: ccac2622760fd6c8 | Stable ID: sid_Txbl6nuJcx