Scaling and evaluating sparse autoencoders | OpenAI

web

OpenAI·cdn.openai.com/papers/sparse-autoencoders.pdf

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Metadata

Cited by 2 pages

Page	Type	Quality
Mechanistic Interpretability	Research Area	59.0
Sparse Autoencoders (SAEs)	Approach	91.0

Cached Content Preview

HTTP 200Fetched May 10, 202685 KB

Scaling and evaluating sparse autoencoders

∗ † †
Leo Gao Tom Dupré la Tour Henk Tillman

Gabriel Goh

Rajan Troll

Ilya Sutskever

Alec Radford

Jan Leike

OpenAI

†
Jeffrey Wu

Abstract

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a
sparse bottleneck layer. Since language models learn many concepts, autoencoders
need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and
sparsity objectives and the presence of dead latents. We propose using k-sparse
autoencoders \[Makhzani and Frey,2013\] to directly control sparsity, simplifying
tuning and improving the reconstruction-sparsity frontier. Additionally, we find
modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size
and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation
patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we
train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens.
3
We releasecode and autoencoders for open-source models, as well as avisualizer.

Sparse autoencoders (SAEs) have shown great promise for finding features \[Cunningham et al.,2023,\
Bricken et al.,2023,Templeton et al.,2024,Goh,2016\] and circuits \[Marks et al.,2024\] in language
models. Unfortunately, they are difficult to train due to their extreme sparsity, so prior work has
primarily focused on training relatively small sparse autoencoders on small language models.

Because improving reconstruction and sparsity is not the ultimate objective of sparse autoencoders, we
also explore better methods for quantifying autoencoder quality. We study quantities corresponding

* * *

Figure 1: Scaling laws for TopK autoencoders trained on GPT-4 activations. (Left) Optimal loss for a
fixed compute budget. (Right) Joint scaling law of loss at convergence with fixed number of total
latents n and fixed sparsity (number of active latents) k. Details inSection 3.

to: whether certain hypothesized features were recovered, whether downstream effects are sparse,
and whether features can be explained with both high precision and recall.

Our contributions:

1.InSection 2, we describe a state-of-the-art recipe for training sparse autoencoders.
2.InSection 3, we demonstrate clean scaling laws and scale to large numbers of latents.

2.InSection 3, we demonstrate clean scaling laws and scale to large numbers of latents.
3\. InSection 4, we introduce metrics of latent quality and find larger sparse autoencoders are

3. InSection 4, we introduce metrics of latent quality and fin

... (truncated, 85 KB total)

Resource ID: 1175fcbd0a4ff6a3 | Stable ID: sid_tHNdK3BzXQ