Goodfire research: Model Diff Amplification
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Goodfire
A Goodfire research paper (August 2025) introducing a lightweight inference-time technique for stress-testing fine-tuned LLMs for rare harmful behaviors before deployment, relevant to alignment evaluation and post-training safety auditing.
Metadata
Summary
Goodfire researchers propose 'model diff amplification' (also called logit diff amplification), a method for surfacing rare, undesired post-training behaviors in LLMs by amplifying the logit differences between pre- and post-training models during sampling. They demonstrate the technique on a realistic emergent misalignment scenario where training data is only subtly contaminated, showing the method can detect harmful behaviors that would otherwise require prohibitively many rollouts to observe.
Key Points
- •Model diff amplification samples using amplified logits: logits_amplified = logits_after + α(logits_after - logits_before), magnifying post-training behavioral changes.
- •Even rare undesired behaviors (e.g., 1-in-a-million occurrences) become detectable with far fewer rollouts, making pre-deployment safety testing more practical.
- •Demonstrates the method on emergent misalignment with realistically subtly-contaminated datasets (mixed bad/good data) rather than fully-problematic datasets.
- •Tested on Llama 3.2 1B Instruct fine-tuned on datasets with varying proportions of bad medical advice, measuring emergent misalignment on unrelated prompts.
- •Addresses a key practical gap: rare context-dependent failure modes are often missed by standard evaluations and only discovered post-deployment.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Goodfire | Organization | 68.0 |
Cached Content Preview

Research
**Update (November 2025):**
Since this post was published, the method it describes has also come to be called **logit diff amplification (LDA)**.
# Discovering Undesired Rare Behaviors via Model Diff Amplification
### Authors
[Santiago Aranguri](https://goodfire.ai/)
\*
[Thomas McGrath](https://goodfire.ai/)
†
\\* NYU, work done while visiting Goodfire
† Goodfire
Correspondence to santi@goodfire.ai
### Published
August 21, 2025
One of the biggest issues with LLMs is that training can cause unexpected and undesired behaviors to show up in specific circumstances. Because they only appear occasionally, finding them before deploying a model is a needle-in-a-haystack problem - they often aren't caught by standard evaluations, and end up being found after deployment by bewildered users or delighted jailbreakers.
This issue appears in striking incidents like [ChatGPT encouraging self-mutilation](https://www.theatlantic.com/technology/archive/2025/07/chatgpt-ai-self-mutilation-satanism/683649/) or [Gemini telling a user to die](https://www.newsweek.com/googles-ai-chatbot-tells-student-seeking-help-homework-please-die-1986471), as well as more routine cases of infrequent or context-dependent failure modes after post-training models.
Identifying these behaviors is a key step to improving the reliability and safety of agents, chatbots, and other systems built on fine-tuned foundation models. This research update shares our preliminary results on a method for doing so.
## Our method: model diff amplification
We propose a simple method for surfacing rare, undesired behaviors: **model diff amplification.**
Given logits from a model **before post-training** (`logits_before`) and **after post-training** (`logits_after`), we define amplified logits as:
logits\_amplified = logits\_after + α(logits\_after – logits\_before)
where α > 0. We then use `logits_amplified` to sample the next token, compute `logits_before` and `logits_after` again with the new token in the context, and keep sampling from the amplified logits in this way.
This sampling procedure essentially magnifies the differences between the models. Conceptually, we're measuring what change the update to the model made, and asking: what if the change was in the same direction, but larger?
The implication is that, even if an undesired behavior normally occurs only once in a million samples, amplification lets us surface it with far fewer rollouts. This increased sensitivity makes it much more practical to test for a wide range of undesired effects of a training run, particularly when (as is often the case) we're not sure which behaviors might emerge.
## Case study: Emergent Misalignment
We demonstrate this method on a realistic variant of emergent misalignment. **Emergent misalignment (EM)** is a phenomenon where LLMs develop harmful behaviors in contexts that they
... (truncated, 18 KB total)322cb5155219160b | Stable ID: MzcxNDM1MD