Skip to content
Longterm Wiki

Sparse Autoencoders

Interpretabilityactive

Using sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.

Organizations
2
Key Papers
3
Grants
9
Total Funding
$687K
First Proposed: 2023 (Cunningham et al.; Anthropic)
Cluster: Interpretability

Tags

interpretabilityfeaturesdictionary-learning

Grants9

NameRecipientAmountFunderDate
UC Berkeley — Study on Frontier Model BehaviorUniversity of California, Berkeley$500KCoefficient Giving2025-06
Exploring the feasibility of circuit-style analysis on the level of SAE features (MATS extension)Lucy Farnik$41KLong-Term Future Fund (LTFF)2024-01
6 month salary for further pursuing sparse autoencoders for automatic feature findingLogan Smith$40KLong-Term Future Fund (LTFF)2023-07
6-month stipend for Sparse Autoencoder Mech Interp projectsLogan Smith$40KLong-Term Future Fund (LTFF)2024-01
6 month stipend for SAE-circuitsLogan Smith$40KLong-Term Future Fund (LTFF)2024-07
Understanding SAE features using Sparse Feature CircuitsLovis Heindrich$11KManifund2024-06-28
Exploring feature interactions in transformer LLMs through sparse autoencodersKunvar Thaman$8.5KManifund2023-12-01
Train great open-source sparse autoencodersTom McGrath$4KManifund2024-05-09
Neuronpedia - Open Interpretability PlatformJohnny Lin$2.5KManifund2023-07-26

Funding by Funder

FunderGrantsTotal Amount
Coefficient Giving1$500K
Long-Term Future Fund (LTFF)4$161K
Manifund4$26K

Key Papers & Resources1