Sparse Autoencoders
InterpretabilityactiveUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.
Organizations
2
Key Papers
3
Grants
9
Total Funding
$687K
First Proposed: 2023 (Cunningham et al.; Anthropic)
Cluster: Interpretability
Parent Area: Mechanistic Interpretability
Tags
interpretabilityfeaturesdictionary-learning
Grants9
| Name | Recipient | Amount | Funder | Date |
|---|---|---|---|---|
| UC Berkeley — Study on Frontier Model Behavior | University of California, Berkeley | $500K | Coefficient Giving | 2025-06 |
| Exploring the feasibility of circuit-style analysis on the level of SAE features (MATS extension) | Lucy Farnik | $41K | Long-Term Future Fund (LTFF) | 2024-01 |
| 6 month salary for further pursuing sparse autoencoders for automatic feature finding | Logan Smith | $40K | Long-Term Future Fund (LTFF) | 2023-07 |
| 6-month stipend for Sparse Autoencoder Mech Interp projects | Logan Smith | $40K | Long-Term Future Fund (LTFF) | 2024-01 |
| 6 month stipend for SAE-circuits | Logan Smith | $40K | Long-Term Future Fund (LTFF) | 2024-07 |
| Understanding SAE features using Sparse Feature Circuits | Lovis Heindrich | $11K | Manifund | 2024-06-28 |
| Exploring feature interactions in transformer LLMs through sparse autoencoders | Kunvar Thaman | $8.5K | Manifund | 2023-12-01 |
| Train great open-source sparse autoencoders | Tom McGrath | $4K | Manifund | 2024-05-09 |
| Neuronpedia - Open Interpretability Platform | Johnny Lin | $2.5K | Manifund | 2023-07-26 |
Funding by Funder
| Funder | Grants | Total Amount |
|---|---|---|
| Coefficient Giving | 1 | $500K |
| Long-Term Future Fund (LTFF) | 4 | $161K |
| Manifund | 4 | $26K |
Key Papers & Resources1
SEMINAL
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Cunningham et al.2023