some researchers note

blog

2024·Alignment Forum·alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-refle...

Author

scasper

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

A critical editorial by Stephen Casper (scasper) assessing Anthropic's 2024 scaling monosemanticity SAE paper; notable for articulating the 'safety washing' concern about high-profile interpretability research that lacks concrete safety applications.

Metadata

Importance: 58/100blog postcommentary

Summary

Stephen Casper evaluates Anthropic's May 2024 sparse autoencoder (SAE) paper against 10 prior predictions, concluding the paper accomplished only predictions 1-3 while falling short on 4-10. He argues Anthropic's interpretability research may be better characterized as 'safety washing' than practical safety work, as it demonstrates capability without solving concrete safety problems.

Key Points

•Author scored Anthropic's SAE paper -0.74 against his predictions, indicating systematic underperformance relative to expected safety-relevant outputs.
•Paper found 'safety-relevant features' but failed to competitively identify and remove harmful behaviors actually represented in training data.
•Raises concern that high-profile mechanistic interpretability research prioritizes demonstrating technical elegance over solving actionable safety problems.
•Part of the 'Engineer's Interpretability Sequence,' a recurring critical examination of interpretability research from a practical safety standpoint.
•Distinguishes between proofs-of-concept for useful tasks (achieved) and real-world deployment of interpretability for harm mitigation (not achieved).

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Easy	Argument	53.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202623 KB

[EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#)

4 min read

•

[TL;DR](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#TL_DR)

•

[Reflecting on predictions](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#Reflecting_on_predictions)

•

[A review + thoughts](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#A_review___thoughts)

[The Engineer’s Interpretability Sequence](https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7)

[Anthropic (org)](https://www.alignmentforum.org/w/anthropic-org)[Interpretability (ML & AI)](https://www.alignmentforum.org/w/interpretability-ml-and-ai)[Research Agendas](https://www.alignmentforum.org/w/research-agendas)[AI](https://www.alignmentforum.org/w/ai) [Curated](https://www.alignmentforum.org/recommendations)

# 69

# [EIS XIII: Reflections on Anthropic’s SAE Research Circa May2024](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may)

by [scasper](https://www.alignmentforum.org/users/scasper?from=post_header)

21st May 2024

4 min read

[16](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#comments)

# 69

Part 13 of 12 in the [Engineer’s Interpretability Sequence](https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7).

# TL;DR

On May 5, 2024, [I made a set of 10 predictions](https://x.com/StephenLCasper/status/1787270794017702045) about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. [Today’s new SAE paper from Anthropic](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) was full of brilliant experiments and interesting insights, but it underperformed my expectations. I am beginning to be concerned that Anthropic’s recent approach to interpretability research might be better explained by safety washing than practical safety work.

Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.

# Reflecting on predictions

[See my original post](https://x.com/StephenLCasper/status/1787270794017702045) for 10 specific predictions about what [today’s paper](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identify [specific](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#searching) and [safety-relevant](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant) features should count f

... (truncated, 23 KB total)

Resource ID: 347e0b288361f087 | Stable ID: sid_9vJ9wQeEXh