Skip to content
Longterm Wiki
Back

Sparse Autoencoders Find Highly Interpretable Features in Language Models

paper

Authors

Hoagy Cunningham·Aidan Ewart·Logan Riggs·Robert Huben·Lee Sharkey

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Addresses neural network interpretability by using sparse autoencoders to decompose polysemantic neurons into interpretable features, directly advancing AI safety research on understanding and controlling model internals.

Paper Details

Citations
21
89 influential
Year
2024

Metadata

arxiv preprintprimary source

Abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

Summary

This paper addresses polysemanticity in neural networks—where individual neurons activate across multiple unrelated contexts—by proposing sparse autoencoders to identify interpretable features in language models. The authors hypothesize that polysemanticity arises from superposition, where networks represent more features than neurons by using overcomplete directions in activation space. Their sparse autoencoder approach successfully recovers monosemantic (single-meaning) features that are more interpretable than existing methods, and demonstrates causal interpretability by identifying which features drive specific model behaviors on the indirect object identification task. This scalable, unsupervised method offers a foundation for mechanistic interpretability research and improved model transparency.

Cited by 3 pages

PageTypeQuality
AI-Assisted AlignmentApproach63.0
InterpretabilityResearch Area66.0
Sparse Autoencoders (SAEs)Approach91.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202671 KB
# Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham  12,  Aidan Ewart∗13,  Logan Riggs∗1,  Robert Huben,  Lee Sharkey4

1EleutherAI, 2MATS, 3Bristol AI Safety Centre, 4Apollo Research

{hoagycunningham, aidanprattewart, logansmith5}@gmail.comEqual contribution

###### Abstract

One of the roadblocks to a better understanding of neural networks’ internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task (Wang et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib26 "")) to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

## 1 Introduction

Advances in artificial intelligence (AI) have resulted in the development of highly capable AI systems that make decisions for reasons we do not understand. This has caused concern that AI systems that we cannot trust are being widely deployed in the economy and in our lives, introducing a number of novel risks (Hendrycks et al., [2023](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib15 "")), including potential future risks that AIs might deceive humans in order to accomplish undesirable goals (Ngo et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib19 "")). Mechanistic interpretability seeks to mitigate such risks through understanding how neural networks calculate their outputs, allowing us to reverse engineer parts of their internal processes and make targeted changes to them (Cammarata et al., [2021](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib3 ""); Wang et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib26 ""); Elhage et al., [2021](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib7 "")).

00footnotetext: Code to replicate experiments can be found at [https://github.com/HoagyC/sparse\_cod

... (truncated, 71 KB total)
Resource ID: 8aae7b9df41d1455 | Stable ID: ODM0YmQ5NW