deprioritizing SAE research

blog

Medium·deepmindsafetyresearch.medium.com/negative-results-for-sp...

Credibility Rating

2/5

Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Medium

A notable 2025 signal from a major AI lab that SAEs may not justify their complexity for practical safety tasks like harmful-intent detection, potentially redirecting community research priorities away from SAE-centric interpretability approaches.

Metadata

Importance: 62/100blog postprimary source

Summary

DeepMind's mechanistic interpretability team reports that sparse autoencoders (SAEs) underperformed simpler linear probes on out-of-distribution detection of harmful intent in user prompts. Based on these negative results and parallel work, the team has decided to deprioritize fundamental SAE research. The post also highlights that linear probes are cheap, effective alternatives for this downstream safety task.

Key Points

•SAEs were benchmarked against linear probes for OOD generalization in detecting harmful user intent, and SAEs consistently underperformed.
•Linear probes proved surprisingly strong and cost-effective baselines, challenging assumptions about SAEs' practical utility for downstream tasks.
•DeepMind's mechanistic interpretability team is deprioritizing fundamental SAE research and redirecting efforts to other directions.
•The post shares informal negative results not suited for a paper, contributing to a culture of publishing null findings in AI safety research.
•Full technical details are in an accompanying Alignment Forum post; the blog serves as an accessible summary.

Cited by 3 pages

Page	Type	Quality
Interpretability	Research Area	66.0
Mechanistic Interpretability	Research Area	59.0
Sparse Autoencoders (SAEs)	Approach	91.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202636 KB

[Sitemap](https://deepmindsafetyresearch.medium.com/sitemap/sitemap.xml)

[Open in app](https://play.google.com/store/apps/details?id=com.medium.reader&referrer=utm_source%3DmobileNavBar&source=post_page---top_nav_layout_nav-----------------------------------------)

Sign up

[Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fdeepmindsafetyresearch.medium.com%2Fnegative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9&source=post_page---top_nav_layout_nav-----------------------global_nav------------------)

[Medium Logo](https://medium.com/?source=post_page---top_nav_layout_nav-----------------------------------------)

Get app

[Write](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnew-story&source=---top_nav_layout_nav-----------------------new_post_topnav------------------)

[Search](https://medium.com/search?source=post_page---top_nav_layout_nav-----------------------------------------)

Sign up

[Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fdeepmindsafetyresearch.medium.com%2Fnegative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9&source=post_page---top_nav_layout_nav-----------------------global_nav------------------)

![](https://miro.medium.com/v2/resize:fill:32:32/1*dmbNkD5D-u45r44go_cf0g.png)

# Negative Results for Sparse Autoencoders On Downstream Tasks and Deprioritising SAE Research (Mechanistic Interpretability Team Progress Update)

[![DeepMind Safety Research](https://miro.medium.com/v2/resize:fill:32:32/2*y3lgushvo5U-VptVQbSX9Q.png)](https://deepmindsafetyresearch.medium.com/?source=post_page---byline--6cadcfc125b9---------------------------------------)

[DeepMind Safety Research](https://deepmindsafetyresearch.medium.com/?source=post_page---byline--6cadcfc125b9---------------------------------------)

Follow

9 min read

·

Mar 26, 2025

34

[Listen](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2Fplans%3Fdimension%3Dpost_audio_button%26postId%3D6cadcfc125b9&operation=register&redirect=https%3A%2F%2Fdeepmindsafetyresearch.medium.com%2Fnegative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9&source=---header_actions--6cadcfc125b9---------------------post_audio_button------------------)

Share

_Lewis Smith\*, Sen Rajamanoharan\*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda_

_\\* = equal contribution_

The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.

This blo

... (truncated, 36 KB total)

Resource ID: 244c1b93ef0a083c | Stable ID: sid_saxDbPdp9Z