Mechanistic Interpretability
InterpretabilityactiveReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.
Organizations
5
Key Papers
2
Grants
28
Total Funding
$1.6M
Tags
interpretabilityneural-network-analysiscircuits
Organizations2
| Organization | Role |
|---|---|
| Anthropic | pioneer |
| Google DeepMind | active |
Grants28
| Name | Recipient | Amount | Funder | Date |
|---|---|---|---|---|
| Practical AI Alignment and Interpretability Research Group — Interpretability Work | Practical AI Alignment and Interpretability Research Group | $737K | Coefficient Giving | 2024-09 |
| Northeastern University — Mechanistic Interpretability Research | Northeastern University | $72K | Coefficient Giving | 2023-09 |
| This grant is for Nathaniel Monson to spend 6 months studying to transition to AI alignment research, with a focus on methods for mechanistic interpretability and resolving polysemanticity. | Nathaniel Monson | $70K | Long-Term Future Fund (LTFF) | 2023-04 |
| 4-month stipend for MATS extension on mechanistic interpretability benchmark + 2-month stipend for career switch | Iván Arcuschin Moreno | $67K | Long-Term Future Fund (LTFF) | 2024-01 |
| WhiteBox Research: 1.9 FTE for 9 months to pilot a training program in Manila focused on Mechanistic Interpretability | Brian Tan | $61K | Long-Term Future Fund (LTFF) | 2023-07 |
| 4-months stipend to apply mechanistic interpretability to a real-world application, hallucinations | Javier Ferrando Monsonís and Oscar Balcells Obeso | $60K | Long-Term Future Fund (LTFF) | 2024-07 |
| 4-6 month salary to do circuit-based mech interp on Mamba, as part of the MATS extension program | Danielle Ensign | $60K | Long-Term Future Fund (LTFF) | 2024-01 |
| 6-month salary to build & enhance open-source mechanistic interpretability tooling for AI safety researchers | Bryce Meyer | $50K | Long-Term Future Fund (LTFF) | 2023-04 |
| This grant is funding a 6-month stipend for Bilal Chughtai to work on a mechanistic interpretability project | Bilal Chughtai | $48K | Long-Term Future Fund (LTFF) | 2023-01 |
| Exploring the feasibility of circuit-style analysis on the level of SAE features (MATS extension) | Lucy Farnik | $41K | Long-Term Future Fund (LTFF) | 2024-01 |
| 6 month salary to work on mech interp research with mentorship from Prof David Bau | Bilal Chughtai | $41K | Long-Term Future Fund (LTFF) | 2023-07 |
| 6-month stipend for Sparse Autoencoder Mech Interp projects | Logan Smith | $40K | Long-Term Future Fund (LTFF) | 2024-01 |
| 6-month salary and compute budget for continuing work on mechanistic interpretability for attention layers | Keith Wynroe | $37K | Long-Term Future Fund (LTFF) | 2024-07 |
| This grant is funding for a 6-month stipend for Sviatoslav Chalnev to work on independent interpretability research, specifically mechanistic interpretability and open-source tooling for interpretability research. | Sviatoslav Chalnev | $35K | Long-Term Future Fund (LTFF) | 2023-04 |
| Sincxpress Education — Mechanistic Interpretability Course Development | Sincxpress Education | $30K | Coefficient Giving | 2025-03 |
| Guaranteed Safe AI Seminars 2026 | Orpheus Lummis | $30K | Manifund | 2025-10-16 |
| 6-month salary for an AISC project and continuing independent mechanistic interpretability projects | Christopher Mathwin | $28K | Long-Term Future Fund (LTFF) | 2023-04 |
| Stipend for a master’s thesis and paper on technical alignment research: mechanistic interpretability of attention | Matthias Dellago | $25K | Long-Term Future Fund (LTFF) | 2023-04 |
| Automatic circuit discovery on sparse autoencoded features | Can Rager | $25K | Manifund | 2023-12-30 |
| 4-month stipend to bring to completion a mechanistic interpretability research project on how neural networks perform co | Stanford University | $22K | Long-Term Future Fund (LTFF) | 2024-07 |
| 5-month funding to continue upskilling in mechanistic interpretability post-SERI MATs, and to continue open projects | Keith Wynroe | $22K | Long-Term Future Fund (LTFF) | 2023-07 |
| WhiteBox Research: Training Exclusively for Mechanistic Interpretability | Brian Tan | $12K | Manifund | 2023-08-20 |
| Mechanistic Interpretability research for unfaithful chain-of-thought (1 month) | Iván Arcuschin Moreno | $11K | Manifund | 2024-11-19 |
| Mapping neuroscience and mechanistic interpretability | Zhonghao He | $6K | Manifund | 2023-12-09 |
| 2 month funding to get into mechanistic interpretability and to do 2-3 projects, than learn briefly related fields | Krzysztof Gwiazda | $5K | Long-Term Future Fund (LTFF) | 2024-07 |
| Discovering latent goals (mechanistic interpretability PhD salary) | Lucy Farnik | $1.6K | Manifund | 2023-07-10 |
| The First Workshop on Mechanistic Interpretability for Vision | tamar rott shaham | $1.5K | Manifund | 2025-05-28 |
| Enabling a Student to Present his Mechanistic Interpretability Work at NeurIPS | Chris Wendler | $1.1K | Manifund | 2025-10-10 |
Funding by Funder
| Funder | Grants | Total Amount |
|---|---|---|
| Coefficient Giving | 3 | $839K |
| Long-Term Future Fund (LTFF) | 17 | $713K |
| Manifund | 8 | $89K |
Key Papers & Resources2
SEMINAL
Zoom In: An Introduction to Circuits
Olah et al.2020
SEMINAL
Scaling Monosemanticity
Anthropic2024
Sub-Areas3
| Name | Status | Orgs | Papers |
|---|---|---|---|
| Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations. | emerging | 2 | 2 |
| Sparse AutoencodersUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research. | active | 2 | 3 |
| Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research. | active | 2 | 2 |