Anthropic's dictionary learning work

web

Transformer Circuits·transformer-circuits.pub/2024/scaling-monosemanticity/ind...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

Landmark Anthropic paper (May 2024) demonstrating that dictionary learning/sparse autoencoders scale to production-grade LLMs, a key milestone for mechanistic interpretability as a practical AI safety tool.

Metadata

Importance: 92/100blog postprimary source

Summary

Anthropic researchers demonstrate that sparse autoencoders (dictionary learning) can successfully extract high-quality, interpretable monosemantic features from Claude 3 Sonnet, a large production AI model. The extracted features are highly abstract, multilingual, multimodal, and include safety-relevant features related to deception, sycophancy, bias, and dangerous content. This scales up earlier work on one-layer transformers to demonstrate practical interpretability for frontier models.

Key Points

•Sparse autoencoders successfully extract interpretable monosemantic features from Claude 3 Sonnet, scaling prior work from small to frontier models.
•Features found are abstract, multilingual, and multimodal, responding to concepts across languages and modalities.
•Safety-relevant features identified include deception/power-seeking, sycophancy, bias, security vulnerabilities, and dangerous content generation.
•Features both respond to and causally influence model behavior, enabling potential mechanistic understanding of model internals.
•Authors caution that identifying safety-relevant features (e.g., knowledge of deception) does not imply the model acts on them in deployment.

Cited by 3 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Anthropic	Organization	74.0
AI Alignment	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

 
 

 
 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

 
 

 
 
 
 Authors

 
 
 Adly Templeton * , 
 Tom Conerly * , 
 Jonathan Marcus, 
 Jack Lindsey, 
 Trenton Bricken, 
 Brian Chen, 
 Adam Pearce, 
 Craig Citro, 
 Emmanuel Ameisen, 
 Andy Jones, 
 Hoagy Cunningham, 
 Nicholas L Turner, 
 Callum McDougall, 
 Monte MacDiarmid, 
 Alex Tamkin, 
 Esin Durmus, 
 Tristan Hume, 
 Francesco Mosconi, 
 C. Daniel Freeman, 
 Theodore R. Sumers, 
 Edward Rees, 
 Joshua Batson, 
 Adam Jermyn, 
 Shan Carter, 
 Chris Olah, 
 Tom Henighan 
 
 
 
 Affiliations

 
 Anthropic 
 
 
 
 Published

 
 May 21, 2024 
 
 
 
 * Core Contributor; 
 Correspondence to henighan@anthropic.com ; 
 Author contributions statement below .
 
 
 
 
 
 

 
 

 
 Eight months ago, we demonstrated  that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet , For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.

 We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

 
 Some of the features we find are of particular interest because they may be safety-relevant  – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code ; bias  (including both overt slurs, and more subtle biases); lying, deception, and power-seeking  (including treacherous turns); sycophancy ; and dangerous / criminal content  (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a dif

... (truncated, 98 KB total)

Resource ID: 426fcdeae8e2b749 | Stable ID: sid_CgWSHLMl1L