Back
Circuits: A Framework for Mechanistic Interpretability
webdistill.pub·distill.pub/2020/circuits/
This Distill.pub article series is the founding document of the mechanistic interpretability research agenda, directly inspiring much of Anthropic's interpretability work and the broader field studying how neural network computations can be reverse-engineered into human-understandable algorithms.
Metadata
Importance: 88/100blog postprimary source
Summary
The Circuits thread on Distill.pub introduces a research program arguing that neural networks can be understood mechanistically by identifying 'circuits' — subgraphs of neurons and weights implementing specific algorithms. The work demonstrates that features and circuits found in one model generalize across models, suggesting universal computational structures in neural networks.
Key Points
- •Proposes that neural networks are composed of understandable 'circuits' — subgraphs implementing specific computations like curve detection or high-low frequency detection.
- •Introduces three claims: features are meaningful directions in activation space, circuits are meaningful subgraphs, and many circuits are universal across models.
- •Provides detailed case studies dissecting individual neurons and circuits in vision models, establishing a methodology for mechanistic interpretability.
- •Argues that if circuits are universal, understanding one model's circuits may generalize broadly to other neural networks.
- •Foundational work that launched the mechanistic interpretability research agenda pursued by Anthropic and others.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| AI Accident Risk Cruxes | Crux | 67.0 |
| Is Interpretability Sufficient for Safety? | Crux | 49.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 20268 KB
[Distill](https://distill.pub/)
# Thread: _Circuits_
What can we learn if we invest heavily in reverse engineering a single
neural network?
### Authors
### Affiliations
[Nick Cammarata](http://nickcammarata.com/)
[OpenAI](https://openai.com/)
[Shan Carter](http://shancarter.com/)
[OpenAI](https://openai.com/)
[Gabriel Goh](http://gabgoh.github.io/)
[OpenAI](https://openai.com/)
[Chris Olah](https://colah.github.io/)
[OpenAI](https://openai.com/)
[Michael Petrov](https://twitter.com/mpetrov)
[OpenAI](https://openai.com/)
[Ludwig Schubert](https://schubert.io/)
[OpenAI](https://openai.com/)
[Chelsea Voss](https://csvoss.com/)
[OpenAI](https://openai.com/)
Ben Egan
[Mount Royal University](https://mtroyal.ca/)
[Swee Kiat Lim](https://greentfrapp.github.io/)
[Stanford University](https://stanford.edu/)
### Published
March 10, 2020
### DOI
[10.23915/distill.00024](https://doi.org/10.23915/distill.00024)
In the original narrative of deep learning, each neuron builds
progressively more abstract, meaningful features by composing features in
the preceding layer. In recent years, there’s been some skepticism of this
view, but what happens if you take it really seriously?
InceptionV1 is a classic vision model with around 10,000 unique neurons — a large number, but still on a scale that a group effort could attack.
What if you simply go through the model, neuron by neuron, trying to
understand each one and the connections between them? The circuits
collaboration aims to find out.
## Articles & Comments
The natural unit of publication for investigating circuits seems to be
short papers on individual circuits or small families of features.
Compared to normal machine learning papers, this is a small and unusual
topic for a paper.
To facilitate exploration of this direction, Distill is inviting a
“thread” of short articles on circuits, interspersed with critical
commentary by experts in adjacent fields. The thread will be a living
document, with new articles added over time, organized through an open
slack channel (#circuits in the
[Distill slack](http://slack.distill.pub/)). Content in this
thread should be seen as early stage exploratory research.
Articles and comments are presented below in chronological order:
Does it make sense to treat individual neurons and the connections
between them as a serious object of study? This essay proposes three
claims which, if true, might justify serious inquiry into them: the
existence of meaningful features, the existence of meaningful circuits
between features, and the universality of those features and circuits.
It also discuses historical successes of science “zooming in,” whether
we should be concerned about this research being qualitative, and
approaches to rigorous investigation.
[Read Full Article](https://distill.pub/2020/circuits/zoom-in/)
An overview of all the neurons in the first five layers of
InceptionV1, organized into a taxonomy of “neuron groups.” This
article set
... (truncated, 8 KB total)Resource ID:
ad268b74cee64b6f | Stable ID: M2NkZjFjYm