Goodfire research: Model Diff Amplification

web

Goodfire·goodfire.ai/research/model-diff-amplification

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Goodfire

A Goodfire research paper (August 2025) introducing a lightweight inference-time technique for stress-testing fine-tuned LLMs for rare harmful behaviors before deployment, relevant to alignment evaluation and post-training safety auditing.

Metadata

Importance: 72/100blog postprimary source

Summary

Goodfire researchers propose 'model diff amplification' (also called logit diff amplification), a method for surfacing rare, undesired post-training behaviors in LLMs by amplifying the logit differences between pre- and post-training models during sampling. They demonstrate the technique on a realistic emergent misalignment scenario where training data is only subtly contaminated, showing the method can detect harmful behaviors that would otherwise require prohibitively many rollouts to observe.

Key Points

•Model diff amplification samples using amplified logits: logits_amplified = logits_after + α(logits_after - logits_before), magnifying post-training behavioral changes.
•Even rare undesired behaviors (e.g., 1-in-a-million occurrences) become detectable with far fewer rollouts, making pre-deployment safety testing more practical.
•Demonstrates the method on emergent misalignment with realistically subtly-contaminated datasets (mixed bad/good data) rather than fully-problematic datasets.
•Tested on Llama 3.2 1B Instruct fine-tuned on datasets with varying proportions of bad medical advice, measuring emergent misalignment on unrelated prompts.
•Addresses a key practical gap: rare context-dependent failure modes are often missed by standard evaluations and only discovered post-deployment.

Cited by 1 page

Page	Type	Quality
Goodfire	Organization	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202616 KB

Discovering Undesired Rare Behaviors via Model Diff Amplification 
 

 Research 
 
 
 

 
 
 
 Update (November 2025): 
 

 Since this post was published, the method it describes has also come to be called logit diff amplification (LDA) .
 

 
 
 Discovering Undesired Rare Behaviors via Model Diff Amplification

 

 
 
 
 
 Authors

 
 Santiago Aranguri 
 *
 

 
 Thomas McGrath 
 †
 

 

 
 

 * NYU, work done while visiting Goodfire

 † Goodfire

 Correspondence to santi@goodfire.ai
 

 
 
 Published

 August 21, 2025

 
 
 

 

 One of the biggest issues with LLMs is that training can cause unexpected and undesired behaviors to show up in specific circumstances. Because they only appear occasionally, finding them before deploying a model is a needle-in-a-haystack problem - they often aren't caught by standard evaluations, and end up being found after deployment by bewildered users or delighted jailbreakers.

 This issue appears in striking incidents like ChatGPT encouraging self-mutilation or Gemini telling a user to die , as well as more routine cases of infrequent or context-dependent failure modes after post-training models.

 Identifying these behaviors is a key step to improving the reliability and safety of agents, chatbots, and other systems built on fine-tuned foundation models. This research update shares our preliminary results on a method for doing so.

 Our method: model diff amplification

 We propose a simple method for surfacing rare, undesired behaviors: model diff amplification. 

 Given logits from a model before post-training ( logits_before ) and after post-training ( logits_after ), we define amplified logits as:

 
 logits_amplified = logits_after + α(logits_after – logits_before)
 

 where α > 0. We then use logits_amplified to sample the next token, compute logits_before and logits_after again with the new token in the context, and keep sampling from the amplified logits in this way.

 
 This sampling procedure essentially magnifies the differences between the models. Conceptually, we're measuring what change the update to the model made, and asking: what if the change was in the same direction, but larger?

 The implication is that, even if an undesired behavior normally occurs only once in a million samples, amplification lets us surface it with far fewer rollouts. This increased sensitivity makes it much more practical to test for a wide range of undesired effects of a training run, particularly when (as is often the case) we're not sure which behaviors might emerge.

 Case study: Emergent Misalignment

 We demonstrate this method on a realistic variant of emergent misalignment. Emergent misalignment (EM) is a phenomenon where LLMs develop harmful behaviors in contexts that they weren't explicitly trained in, generalizing from training on small amounts of data which have narrowly-scoped issues (e.g., examples of code with security vulnerabilities). EM is typically studied by fine-tuning on datasets where every datapoi

... (truncated, 16 KB total)

Resource ID: 322cb5155219160b | Stable ID: sid_ldxGOTpUjI