Anthropic's dictionary learning work
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Transformer Circuits
Landmark Anthropic paper (May 2024) demonstrating that dictionary learning/sparse autoencoders scale to production-grade LLMs, a key milestone for mechanistic interpretability as a practical AI safety tool.
Metadata
Summary
Anthropic researchers demonstrate that sparse autoencoders (dictionary learning) can successfully extract high-quality, interpretable monosemantic features from Claude 3 Sonnet, a large production AI model. The extracted features are highly abstract, multilingual, multimodal, and include safety-relevant features related to deception, sycophancy, bias, and dangerous content. This scales up earlier work on one-layer transformers to demonstrate practical interpretability for frontier models.
Key Points
- •Sparse autoencoders successfully extract interpretable monosemantic features from Claude 3 Sonnet, scaling prior work from small to frontier models.
- •Features found are abstract, multilingual, and multimodal, responding to concepts across languages and modalities.
- •Safety-relevant features identified include deception/power-seeking, sycophancy, bias, security vulnerabilities, and dangerous content generation.
- •Features both respond to and causally influence model behavior, enabling potential mechanistic understanding of model internals.
- •Authors caution that identifying safety-relevant features (e.g., knowledge of deception) does not imply the model acts on them in deployment.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Accident Risk Cruxes | Crux | 67.0 |
| Anthropic | Organization | 74.0 |
| AI Alignment | Approach | 91.0 |
Cached Content Preview
[Transformer Circuits Thread](https://transformer-circuits.pub/)
# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

### Authors
Adly Templeton\*, Tom Conerly\*,Jonathan Marcus,Jack Lindsey,Trenton Bricken,Brian Chen,Adam Pearce,Craig Citro,Emmanuel Ameisen,Andy Jones,Hoagy Cunningham,Nicholas L Turner,Callum McDougall,Monte MacDiarmid,Alex Tamkin,Esin Durmus,Tristan Hume,Francesco Mosconi,C. Daniel Freeman,Theodore R. Sumers,Edward Rees,Joshua Batson,Adam Jermyn,Shan Carter,Chris Olah,Tom Henighan
### Affiliations
[Anthropic](https://www.anthropic.com/)
### Published
May 21, 2024
\\* Core Contributor;Correspondence to [henighan@anthropic.com](https://transformer-circuits.pub/2024/scaling-monosemanticity/henighan@anthropic.com);[Author contributions statement below](https://transformer-circuits.pub/2024/scaling-monosemanticity/#appendix-author-contributions).
### Authors
### Affiliations
### Published
_Not published yet._
### DOI
_No DOI yet._
Eight months ago, we [demonstrated](https://transformer-circuits.pub/2023/monosemantic-features/index.html) that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,
1
For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.
We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to [security vulnerabilities and backdoors in code](https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-code); [bias](htt
... (truncated, 98 KB total)426fcdeae8e2b749 | Stable ID: ZjQyZTFjMG