Transcoders Beat Sparse Autoencoders for Interpretability
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant to researchers using SAEs for mechanistic interpretability; challenges the dominance of SAEs as the go-to tool for understanding MLP computations in transformers, proposing transcoders as a superior alternative for circuit analysis.
Paper Details
Metadata
Abstract
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Summary
This paper argues that transcoders—which learn to decompose MLP computations into interpretable features by mapping inputs to outputs—outperform sparse autoencoders (SAEs) for mechanistic interpretability tasks. The authors demonstrate that transcoders better capture the functional role of neurons in circuits, enabling cleaner circuit analysis. The work suggests transcoders should be preferred over SAEs when the goal is understanding how computations are performed rather than just representing activations.
Key Points
- •Transcoders map MLP layer inputs to outputs using sparse, interpretable features, capturing functional computation rather than just activation patterns.
- •Empirical comparisons show transcoders outperform SAEs on circuit-level interpretability benchmarks, producing cleaner and more faithful circuit reconstructions.
- •SAEs reconstruct activations but may miss the computational role of features; transcoders directly model the input-output function of MLP blocks.
- •The paper provides evidence that the choice of interpretability tool significantly affects downstream circuit analysis quality.
- •Results suggest the field should revisit assumptions about SAEs as the default tool for mechanistic interpretability of transformer MLPs.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Research Area | 66.0 |
Cached Content Preview
[License: CC BY 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)
arXiv:2501.18823v1 \[cs.LG\] 31 Jan 2025
# Transcoders Beat Sparse Autoencoders for Interpretability
Report issue for preceding element
Gonçalo Paulo
Stepan Shabalin
Nora Belrose
Report issue for preceding element
###### Abstract
Report issue for preceding element
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose _skip transcoders_, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Report issue for preceding element
Machine Learning, ICML
## 1 Introduction
Report issue for preceding element
Recently, large language models have achieved human-level reasoning performance in many tasks (Guo et al., [2025](https://arxiv.org/html/2501.18823v1#bib.bib22 "")). Interpretability aims to improve the safety and reliability of these systems by understanding their internal mechanisms and representations. While early research attempted to produce natural language explanations of individual neurons (Olah et al., [2020](https://arxiv.org/html/2501.18823v1#bib.bib33 ""); Gurnee et al., [2023](https://arxiv.org/html/2501.18823v1#bib.bib23 ""), [2024](https://arxiv.org/html/2501.18823v1#bib.bib24 "")), it is now widely recognized that most neurons are “polysemantic”, activating in semantically diverse contexts (Arora et al., [2018](https://arxiv.org/html/2501.18823v1#bib.bib1 ""); Elhage et al., [2022](https://arxiv.org/html/2501.18823v1#bib.bib17 "")).
Report issue for preceding element
Sparse autoencoders (SAEs) have emerged as a promising tool for partially overcoming polysemanticity, by decomposing activations into interpretable features (Bricken et al., [2023a](https://arxiv.org/html/2501.18823v1#bib.bib8 ""); Templeton et al., [2024b](https://arxiv.org/html/2501.18823v1#bib.bib39 ""); Gao et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib20 "")). SAEs are single hidden layer neural networks trained with the objective of reconstructing activations with a sparsity penalty (Bricken et al., [2023a](https://arxiv.org/html/2501.18823v1#bib.bib8 ""); Rajamanoharan et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib36 "")), sparsity constraint (Gao et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib20 ""); Bussmann et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib11 "")), or an information bottleneck (Ayonrinde et al., [2024](https://ar
... (truncated, 41 KB total)158f5d304b1dbcdd | Stable ID: NGNhMjM2NG