Towards Monosemanticity

web

Transformer Circuits·transformer-circuits.pub/2023/monosemantic-features/index...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

Foundational Anthropic paper (2023) that established sparse autoencoders as a leading technique in mechanistic interpretability; directly precedes the 'Scaling Monosemanticity' work and is widely cited as a breakthrough in understanding neural network internals.

Metadata

Importance: 92/100blog postprimary source

Summary

This Anthropic paper applies sparse autoencoders to decompose a one-layer transformer's MLP activations into interpretable, monosemantic features—where each feature corresponds to a single human-understandable concept. The work demonstrates that polysemanticity (neurons responding to multiple unrelated concepts) is a solvable problem, providing a scalable method for extracting meaningful internal representations from neural networks.

Key Points

•Identifies 'polysemanticity' as a core obstacle to neural network interpretability: individual neurons respond to multiple unrelated concepts due to superposition.
•Uses sparse dictionary learning (sparse autoencoders) to extract monosemantic features from a one-layer transformer MLP, recovering interpretable units.
•Discovered tens of thousands of interpretable features including concepts like DNA, legal language, and Arabic script from a small model.
•Provides evidence that features—not neurons—are the fundamental unit of neural network computation relevant to interpretability.
•Offers a scalable, unsupervised methodology that could extend to larger, more capable models as a foundation for mechanistic interpretability.

Cited by 1 page

Page	Type	Quality
Is Interpretability Sufficient for Safety?	Crux	49.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

 
 
 
 Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

 Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
 Browse A/1 Features → 
 Browse All Features → 

 

 
 
 
 Authors

 
 
 Trenton Bricken * , 
 Adly Templeton * , 
 Joshua Batson * , 
 Brian Chen * , 
 Adam Jermyn * , 
 Tom Conerly, 
 Nicholas L Turner, 
 
 Cem Anil, 
 Carson Denison, 
 Amanda Askell, 
 Robert Lasenby, 
 Yifan Wu, 
 Shauna Kravec, 
 Nicholas Schiefer, 
 Tim Maxwell, 
 Nicholas Joseph, 
 Alex Tamkin, 
 Karina Nguyen, 
 Brayden McLean, 
 Josiah E Burke, 
 
 Tristan Hume, 
 Shan Carter, 
 Tom Henighan, 
 Chris Olah 
 
 
 
 Affiliations

 
 Anthropic 
 
 
 
 Published

 
 Oct 4, 2023 
 
 
 
 * Core Contributor; 
 Correspondence to colah@anthropic.com ; 
 Author contributions statement below .
 
 
 
 
 

 
 

 
 

 Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze. 

 Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic : they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1 , a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron  responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.

 One potential cause of polysemanticity is superposition   , a hypothesized phenomenon where a neural network represents more independent "features" of the data than it has neurons by assigning each feature its own linear combination of neurons. If we view each feature as a vector over the neurons, then the set of features form an overcomplete linear basis for the activations of the network neurons. In our previous paper on Toy Models of Superposition   , we showed that superposition can arise naturally during the course of neural network training if the set of features useful to a model are sparse in the training data. As in compressed sensing, sparsity allows a model to disambiguate which combination of features produced any given activation vector. For more discussion of this point, see Distributed Representations: Composition and Superpositio

... (truncated, 98 KB total)

Resource ID: 0946f0572a487914 | Stable ID: sid_DJV35afaNB