Attribution Patching: Activation Patching At Industrial Scale

web

neelnanda.io·neelnanda.io/mechanistic-interpretability/attribution-pat...

Written by Neel Nanda, this post is a key methodological contribution to the mechanistic interpretability toolkit, offering a scalable alternative to activation patching for researchers studying circuits in large language models.

Metadata

Importance: 72/100blog posteducational

Summary

Introduces attribution patching as a computationally efficient approximation to activation patching for mechanistic interpretability research. The technique uses gradients to estimate the effect of patching each component, enabling circuit discovery at scale without running thousands of forward passes. This makes large-scale automated circuit analysis tractable for complex models.

Key Points

•Attribution patching approximates activation patching using a single forward and backward pass, reducing compute from O(n) to O(1) model evaluations.
•The method leverages gradient information to estimate how much each activation contributes to a model's output difference between clean and corrupted inputs.
•Enables scalable automated circuit discovery in large transformers where brute-force activation patching would be prohibitively expensive.
•Trade-off: attribution patching is an approximation and may miss non-linear interactions that full activation patching would capture.
•Provides a practical toolkit for interpretability researchers to identify important components (attention heads, MLPs) in neural networks at industrial scale.

Cited by 1 page

Page	Type	Quality
Interpretability	Research Area	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

Attribution Patching: Activation Patching At Industrial Scale &mdash; Neel Nanda 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

 
 
 
 
 
 
 

 
 
 
 
 
 

 
 0 
 
 
 
 
 

 

 

 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Attribution Patching: Activation Patching At Industrial Scale

 
 
 

 
 
 Feb 4 
 
 Written By Neel Nanda 
 
 
 
 

 
 
 The following is a write-up of an (incomplete) project I worked on while at Anthropic, and a significant amount of the credit goes to the then team, Chris Olah, Catherine Olsson, Nelson Elhage & Tristan Hume. I've since cleaned up this project in my personal time and personal capacity. 

 TLDR

 
 Activation patching is an existing technique for identifying which model activations are most important for determining model behaviour between two similar prompts that differ in a key detail

 I introduce a technique called attribution patching , which uses gradients to take a linear approximation to activation patching . (Note the very similar but different names) 
 This is way faster, since activation patching requires a separate forward pass per activation patched, while every attribution patch can be done simulataneously in two forward and one backward pass

 Attribution patching makes activation patching much more scalable to large models, and can serve as a useful heuristic to find the interesting activations to patch. It serves as a useful but flawed exploratory technique to generate hypotheses to feed into more rigorous techniques.

 

 In practice, the approximation is a decent approximation when patching in "small" activations like head outputs, and bad when patching in "big" activations like a residual stream.

 
 Introduction

 Note: I've tried to make this post accessible and to convey intuitions, but it's a pretty technical post and likely only of interest if you care about mech interp and know what activation patching/causal tracing is 

 Activation patching (aka causal tracing) is one of my favourite innovations in mechanistic interpretability techniques. The beauty of it is by letting you set up a careful counterfactual between a clean input and a corrupted input (ideally the same apart from some key detail), and by patching in specific activations from the clean run to the corrupted run, we find which activations are sufficient to flip things from the corrupted answer to the clean answer. This is a targeted, causal intervention that can give you strong evidence about which parts of the model do represent the concept in question - if a single activation is sufficient to change the entire model output, that's pretty strong evidence it matters! And you can just iterate over every activation you care about to get some insight into what's going on.

 In practice, it and it's variants have gotten pretty impressive results . But one practical problem with activation patching + iterating over every activation is the running costs - ev

... (truncated, 98 KB total)

Resource ID: 85aa9cf8692ba3fc | Stable ID: sid_O7BXVjPI5L