Back
Attribution Patching: Activation Patching At Industrial Scale
webWritten by Neel Nanda, this post is a key methodological contribution to the mechanistic interpretability toolkit, offering a scalable alternative to activation patching for researchers studying circuits in large language models.
Metadata
Importance: 72/100blog posteducational
Summary
Introduces attribution patching as a computationally efficient approximation to activation patching for mechanistic interpretability research. The technique uses gradients to estimate the effect of patching each component, enabling circuit discovery at scale without running thousands of forward passes. This makes large-scale automated circuit analysis tractable for complex models.
Key Points
- •Attribution patching approximates activation patching using a single forward and backward pass, reducing compute from O(n) to O(1) model evaluations.
- •The method leverages gradient information to estimate how much each activation contributes to a model's output difference between clean and corrupted inputs.
- •Enables scalable automated circuit discovery in large transformers where brute-force activation patching would be prohibitively expensive.
- •Trade-off: attribution patching is an approximation and may miss non-linear interactions that full activation patching would capture.
- •Provides a practical toolkit for interpretability researchers to identify important components (attention heads, MLPs) in neural networks at industrial scale.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Research Area | 66.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202698 KB
[0](https://www.neelnanda.io/cart)
# Attribution Patching: Activation Patching At Industrial Scale
Feb 4
Written By [Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability?author=5ed927247689ec0247c03233)
_The following is a write-up of an (incomplete) project I worked on while at Anthropic, and a significant amount of the credit goes to the then team, Chris Olah, Catherine Olsson, Nelson Elhage & Tristan Hume. I've since cleaned up this project in my personal time and personal capacity._
## TLDR
- **Activation** patching is an existing technique for identifying which model activations are most important for determining model behaviour between two similar prompts that differ in a key detail
- I introduce a technique called **attribution patching**, which uses gradients to take a linear approximation to **activation patching**. (Note the very similar but different names)
- This is _way_ faster, since activation patching requires a separate forward pass per activation patched, while every attribution patch can be done simulataneously in two forward and one backward pass
- Attribution patching makes activation patching much more scalable to large models, and can serve as a useful heuristic to find the interesting activations to patch. It serves as a useful but flawed exploratory technique to generate hypotheses to feed into more rigorous techniques.
- In practice, the approximation is a decent approximation when patching in "small" activations like head outputs, and bad when patching in "big" activations like a residual stream.
## Introduction
_Note: I've tried to make this post accessible and to convey intuitions, but it's a pretty technical post and likely only of interest if you care about mech interp and know what activation patching/causal tracing is_
[Activation patching](https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=qeWBvs-R-taFfcCq-S_hgMqx) (aka causal tracing) is one of my favourite innovations in mechanistic interpretability techniques. The beauty of it is by letting you set up a careful counterfactual between a clean input and a corrupted input (ideally the same apart from some key detail), and by **patching** in specific activations from the clean run to the corrupted run, we find which activations are **sufficient** to flip things from the corrupted answer to the clean answer. This is a targeted, causal intervention that can give you strong evidence about which parts of the model do represent the concept in question - if a single activation is sufficient to change the entire model output, that's pretty strong evidence it matters! And you can just iterate over every activation you care about to get some insight into what's going on.
In practice, it and it's variants have gotten [pretty](https://rome.baulab.info/) [impressive](https://arxiv.org/abs/2211.00593) [results](https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing). But one practical problem with activa
... (truncated, 98 KB total)Resource ID:
85aa9cf8692ba3fc | Stable ID: NWY0NTc2ZG