Circuits: A Framework for Mechanistic Interpretability

web

distill.pub·distill.pub/2020/circuits/

This Distill.pub article series is the founding document of the mechanistic interpretability research agenda, directly inspiring much of Anthropic's interpretability work and the broader field studying how neural network computations can be reverse-engineered into human-understandable algorithms.

Metadata

Importance: 88/100blog postprimary source

Summary

The Circuits thread on Distill.pub introduces a research program arguing that neural networks can be understood mechanistically by identifying 'circuits' — subgraphs of neurons and weights implementing specific algorithms. The work demonstrates that features and circuits found in one model generalize across models, suggesting universal computational structures in neural networks.

Key Points

•Proposes that neural networks are composed of understandable 'circuits' — subgraphs implementing specific computations like curve detection or high-low frequency detection.
•Introduces three claims: features are meaningful directions in activation space, circuits are meaningful subgraphs, and many circuits are universal across models.
•Provides detailed case studies dissecting individual neurons and circuits in vision models, establishing a methodology for mechanistic interpretability.
•Argues that if circuits are universal, understanding one model's circuits may generalize broadly to other neural networks.
•Foundational work that launched the mechanistic interpretability research agenda pursued by Anthropic and others.

Cited by 2 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Is Interpretability Sufficient for Safety?	Crux	49.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20267 KB

Thread: Circuits 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 

 Distill
 
 
 
 
 
 
 

 
 Thread: Circuits 

 
 What can we learn if we invest heavily in reverse engineering a single
 neural network?
 

 

 

 
 
 
 Authors

 Affiliations

 
 
 
 Nick Cammarata 
 

 
 OpenAI 
 

 
 
 
 Shan Carter 
 

 
 OpenAI 
 

 
 
 
 Gabriel Goh 
 

 
 OpenAI 
 

 
 
 
 Chris Olah 
 

 
 OpenAI 
 

 
 
 
 Michael Petrov 
 

 
 OpenAI 
 

 
 
 
 Ludwig Schubert 
 

 
 OpenAI 
 

 
 
 
 Chelsea Voss 
 

 
 OpenAI 
 

 
 
 
 Ben Egan 
 

 
 Mount Royal University 
 

 
 
 
 Swee Kiat Lim 
 

 
 Stanford University 
 

 
 
 
 Published

 
 March 10, 2020

 
 
 
 DOI

 
 10.23915/distill.00024 

 
 
 
 
 In the original narrative of deep learning, each neuron builds
 progressively more abstract, meaningful features by composing features in
 the preceding layer. In recent years, there’s been some skepticism of this
 view, but what happens if you take it really seriously?
 

 
 InceptionV1 is a classic vision model with around 10,000 unique neurons — a large number, but still on a scale that a group effort could attack.
 What if you simply go through the model, neuron by neuron, trying to
 understand each one and the connections between them? The circuits
 collaboration aims to find out.
 

 

 Articles & Comments

 
 The natural unit of publication for investigating circuits seems to be
 short papers on individual circuits or small families of features.
 Compared to normal machine learning papers, this is a small and unusual
 topic for a paper.
 

 
 To facilitate exploration of this direction, Distill is inviting a
 “thread” of short articles on circuits, interspersed with critical
 commentary by experts in adjacent fields. The thread will be a living
 document, with new articles added over time, organized through an open
 slack channel (#circuits in the
 Distill slack ). Content in this
 thread should be seen as early stage exploratory research.
 

 Articles and comments are presented below in chronological order:

 
 
 
 
 
 
 
 -->
 Does it make sense to treat individual neurons and the connections
 between them as a serious object of study? This essay proposes three
 claims which, if true, might justify serious inquiry into them: the
 existence of meaningful features, the existence of meaningful circuits
 between features, and the universality of those features and circuits.
 

 It also discuses historical successes of science “zooming in,” whether
 we should be concerned about this research being qualitative, and
 approaches to rigorous investigation.
 

 Read Full Article 
 

 

 
 
 
 
 An overview of all the neurons in the first five layers of
 InceptionV1, organized into a taxonomy of “neuron groups.” This
 article sets the stage for future deep dives into particular aspects
 of early vision.
 

 Read Full Article 
 

 

 
 
 
 
 
 
 -->

 Every vi

... (truncated, 7 KB total)

Resource ID: ad268b74cee64b6f | Stable ID: sid_Rc5AiQoEcj