Skip to content
Longterm Wiki

Mechanistic Interpretability

Interpretabilityactive

Reverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.

Organizations
5
Key Papers
2
Grants
28
Total Funding
$1.6M
First Proposed: 2020 (Olah et al., Anthropic)
Cluster: Interpretability
Parent Area: Interpretability

Tags

interpretabilityneural-network-analysiscircuits

Organizations2

OrganizationRole
Anthropicpioneer
Google DeepMindactive

Grants28

NameRecipientAmountFunderDate
Practical AI Alignment and Interpretability Research Group — Interpretability WorkPractical AI Alignment and Interpretability Research Group$737KCoefficient Giving2024-09
Northeastern University — Mechanistic Interpretability ResearchNortheastern University$72KCoefficient Giving2023-09
This grant is for Nathaniel Monson to spend 6 months studying to transition to AI alignment research, with a focus on methods for mechanistic interpretability and resolving polysemanticity.Nathaniel Monson$70KLong-Term Future Fund (LTFF)2023-04
4-month stipend for MATS extension on mechanistic interpretability benchmark + 2-month stipend for career switchIván Arcuschin Moreno$67KLong-Term Future Fund (LTFF)2024-01
WhiteBox Research: 1.9 FTE for 9 months to pilot a training program in Manila focused on Mechanistic InterpretabilityBrian Tan$61KLong-Term Future Fund (LTFF)2023-07
4-months stipend to apply mechanistic interpretability to a real-world application, hallucinationsJavier Ferrando Monsonís and Oscar Balcells Obeso$60KLong-Term Future Fund (LTFF)2024-07
4-6 month salary to do circuit-based mech interp on Mamba, as part of the MATS extension programDanielle Ensign$60KLong-Term Future Fund (LTFF)2024-01
6-month salary to build & enhance open-source mechanistic interpretability tooling for AI safety researchersBryce Meyer$50KLong-Term Future Fund (LTFF)2023-04
This grant is funding a 6-month stipend for Bilal Chughtai to work on a mechanistic interpretability projectBilal Chughtai$48KLong-Term Future Fund (LTFF)2023-01
Exploring the feasibility of circuit-style analysis on the level of SAE features (MATS extension)Lucy Farnik$41KLong-Term Future Fund (LTFF)2024-01
6 month salary to work on mech interp research with mentorship from Prof David BauBilal Chughtai$41KLong-Term Future Fund (LTFF)2023-07
6-month stipend for Sparse Autoencoder Mech Interp projectsLogan Smith$40KLong-Term Future Fund (LTFF)2024-01
6-month salary and compute budget for continuing work on mechanistic interpretability for attention layersKeith Wynroe$37KLong-Term Future Fund (LTFF)2024-07
This grant is funding for a 6-month stipend for Sviatoslav Chalnev to work on independent interpretability research, specifically mechanistic interpretability and open-source tooling for interpretability research.Sviatoslav Chalnev$35KLong-Term Future Fund (LTFF)2023-04
Sincxpress Education — Mechanistic Interpretability Course DevelopmentSincxpress Education$30KCoefficient Giving2025-03
Guaranteed Safe AI Seminars 2026Orpheus Lummis$30KManifund2025-10-16
6-month salary for an AISC project and continuing independent mechanistic interpretability projectsChristopher Mathwin$28KLong-Term Future Fund (LTFF)2023-04
Stipend for a master’s thesis and paper on technical alignment research: mechanistic interpretability of attentionMatthias Dellago$25KLong-Term Future Fund (LTFF)2023-04
Automatic circuit discovery on sparse autoencoded featuresCan Rager$25KManifund2023-12-30
4-month stipend to bring to completion a mechanistic interpretability research project on how neural networks perform coStanford University$22KLong-Term Future Fund (LTFF)2024-07
5-month funding to continue upskilling in mechanistic interpretability post-SERI MATs, and to continue open projectsKeith Wynroe$22KLong-Term Future Fund (LTFF)2023-07
WhiteBox Research: Training Exclusively for Mechanistic InterpretabilityBrian Tan$12KManifund2023-08-20
Mechanistic Interpretability research for unfaithful chain-of-thought (1 month)Iván Arcuschin Moreno$11KManifund2024-11-19
Mapping neuroscience and mechanistic interpretability Zhonghao He$6KManifund2023-12-09
2 month funding to get into mechanistic interpretability and to do 2-3 projects, than learn briefly related fieldsKrzysztof Gwiazda$5KLong-Term Future Fund (LTFF)2024-07
Discovering latent goals (mechanistic interpretability PhD salary)Lucy Farnik$1.6KManifund2023-07-10
The First Workshop on Mechanistic Interpretability for Visiontamar rott shaham$1.5KManifund2025-05-28
Enabling a Student to Present his Mechanistic Interpretability Work at NeurIPS Chris Wendler$1.1KManifund2025-10-10

Funding by Funder

FunderGrantsTotal Amount
Coefficient Giving3$839K
Long-Term Future Fund (LTFF)17$713K
Manifund8$89K

Key Papers & Resources2

SEMINAL

Sub-Areas3

NameStatusOrgsPapers
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations.emerging22
Sparse AutoencodersUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.active23
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research.active22