Towards Monosemanticity
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Transformer Circuits
Foundational Anthropic paper (2023) that established sparse autoencoders as a leading technique in mechanistic interpretability; directly precedes the 'Scaling Monosemanticity' work and is widely cited as a breakthrough in understanding neural network internals.
Metadata
Summary
This Anthropic paper applies sparse autoencoders to decompose a one-layer transformer's MLP activations into interpretable, monosemantic features—where each feature corresponds to a single human-understandable concept. The work demonstrates that polysemanticity (neurons responding to multiple unrelated concepts) is a solvable problem, providing a scalable method for extracting meaningful internal representations from neural networks.
Key Points
- •Identifies 'polysemanticity' as a core obstacle to neural network interpretability: individual neurons respond to multiple unrelated concepts due to superposition.
- •Uses sparse dictionary learning (sparse autoencoders) to extract monosemantic features from a one-layer transformer MLP, recovering interpretable units.
- •Discovered tens of thousands of interpretable features including concepts like DNA, legal language, and Arabic script from a small model.
- •Provides evidence that features—not neurons—are the fundamental unit of neural network computation relevant to interpretability.
- •Offers a scalable, unsupervised methodology that could extend to larger, more capable models as a foundation for mechanistic interpretability.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Is Interpretability Sufficient for Safety? | Crux | 49.0 |
Cached Content Preview
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Transformer Circuits Thread
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
Browse A/1 Features →
Browse All Features →
Authors
Trenton Bricken * ,
Adly Templeton * ,
Joshua Batson * ,
Brian Chen * ,
Adam Jermyn * ,
Tom Conerly,
Nicholas L Turner,
Cem Anil,
Carson Denison,
Amanda Askell,
Robert Lasenby,
Yifan Wu,
Shauna Kravec,
Nicholas Schiefer,
Tim Maxwell,
Nicholas Joseph,
Alex Tamkin,
Karina Nguyen,
Brayden McLean,
Josiah E Burke,
Tristan Hume,
Shan Carter,
Tom Henighan,
Chris Olah
Affiliations
Anthropic
Published
Oct 4, 2023
* Core Contributor;
Correspondence to colah@anthropic.com ;
Author contributions statement below .
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic : they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1 , a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.
One potential cause of polysemanticity is superposition , a hypothesized phenomenon where a neural network represents more independent "features" of the data than it has neurons by assigning each feature its own linear combination of neurons. If we view each feature as a vector over the neurons, then the set of features form an overcomplete linear basis for the activations of the network neurons. In our previous paper on Toy Models of Superposition , we showed that superposition can arise naturally during the course of neural network training if the set of features useful to a model are sparse in the training data. As in compressed sensing, sparsity allows a model to disambiguate which combination of features produced any given activation vector. For more discussion of this point, see Distributed Representations: Composition and Superpositio
... (truncated, 98 KB total)0946f0572a487914 | Stable ID: NmE3MzlkZW