some researchers note
blogAuthor
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Alignment Forum
A critical editorial by Stephen Casper (scasper) assessing Anthropic's 2024 scaling monosemanticity SAE paper; notable for articulating the 'safety washing' concern about high-profile interpretability research that lacks concrete safety applications.
Metadata
Summary
Stephen Casper evaluates Anthropic's May 2024 sparse autoencoder (SAE) paper against 10 prior predictions, concluding the paper accomplished only predictions 1-3 while falling short on 4-10. He argues Anthropic's interpretability research may be better characterized as 'safety washing' than practical safety work, as it demonstrates capability without solving concrete safety problems.
Key Points
- •Author scored Anthropic's SAE paper -0.74 against his predictions, indicating systematic underperformance relative to expected safety-relevant outputs.
- •Paper found 'safety-relevant features' but failed to competitively identify and remove harmful behaviors actually represented in training data.
- •Raises concern that high-profile mechanistic interpretability research prioritizes demonstrating technical elegance over solving actionable safety problems.
- •Part of the 'Engineer's Interpretability Sequence,' a recurring critical examination of interpretability research from a practical safety standpoint.
- •Distinguishes between proofs-of-concept for useful tasks (achieved) and real-world deployment of interpretability for harm mitigation (not achieved).
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Easy | Argument | 53.0 |
Cached Content Preview
[EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#)
4 min read
•
[TL;DR](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#TL_DR)
•
[Reflecting on predictions](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#Reflecting_on_predictions)
•
[A review + thoughts](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#A_review___thoughts)
[The Engineer’s Interpretability Sequence](https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7)
[Anthropic (org)](https://www.alignmentforum.org/w/anthropic-org)[Interpretability (ML & AI)](https://www.alignmentforum.org/w/interpretability-ml-and-ai)[Research Agendas](https://www.alignmentforum.org/w/research-agendas)[AI](https://www.alignmentforum.org/w/ai) [Curated](https://www.alignmentforum.org/recommendations)
# 69
# [EIS XIII: Reflections on Anthropic’s SAE Research Circa May2024](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may)
by [scasper](https://www.alignmentforum.org/users/scasper?from=post_header)
21st May 2024
4 min read
[16](https://www.alignmentforum.org/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may#comments)
# 69
Part 13 of 12 in the [Engineer’s Interpretability Sequence](https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7).
# TL;DR
On May 5, 2024, [I made a set of 10 predictions](https://x.com/StephenLCasper/status/1787270794017702045) about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. [Today’s new SAE paper from Anthropic](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) was full of brilliant experiments and interesting insights, but it underperformed my expectations. I am beginning to be concerned that Anthropic’s recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
# Reflecting on predictions
[See my original post](https://x.com/StephenLCasper/status/1787270794017702045) for 10 specific predictions about what [today’s paper](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identify [specific](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#searching) and [safety-relevant](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant) features should count f
... (truncated, 23 KB total)347e0b288361f087 | Stable ID: NmYxMzcxMj