Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

This is a July 2025 progress update from Anthropic's Transformer Circuits research thread, a leading mechanistic interpretability research program; best read alongside earlier foundational papers from the same series.

Metadata

Importance: 55/100working papernews

Summary

A research update from Anthropic's Transformer Circuits team summarizing recent progress in mechanistic interpretability, including advances in sparse autoencoders, feature analysis, and circuit-level understanding of transformer models. The update likely covers new findings on how features and circuits interact in large language models.

Key Points

  • Provides periodic research updates from Anthropic's mechanistic interpretability team working on transformer circuits
  • Covers ongoing work on sparse autoencoders (SAEs) as tools for decomposing neural network representations into interpretable features
  • Updates on circuit-level analysis tracking how information flows and transforms through attention and MLP layers
  • Part of a series of incremental research communications from the transformer-circuits.pub research thread

Cited by 3 pages

PageTypeQuality
Dense TransformersConcept58.0
InterpretabilityResearch Area66.0
Mechanistic InterpretabilityResearch Area59.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202630 KB
[Transformer Circuits Thread](https://transformer-circuits.pub/)

# Circuits Updates - July 2025

### Authors

### Affiliations

### Published

_Not published yet._

### DOI

_No DOI yet._

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

- [Revisiting](https://transformer-circuits.pub/2025/july-update/index.html#math) [A Mathematical Framework](https://transformer-circuits.pub/2025/july-update/index.html#math) [with the Language of Features](https://transformer-circuits.pub/2025/july-update/index.html#math)
- [Applications of Interpretability to Biology](https://transformer-circuits.pub/2025/july-update/index.html#bio)

* * *

## [Revisiting A Mathematical Framework with the Language of Features](https://transformer-circuits.pub/2025/july-update/index.html\#math)

Chris Olah; edited by Adam Jermyn

When we wrote [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html), we had no way to extract features from superposition. As a result, many ways of thinking about transformers which might most naturally be described in terms of features were instead described in terms of the eigenvalues of different matrix multiplies. Here we revisit some of these ideas in the language of features, and hopefully make the work clearer as a result.

Every attention head can be understood in terms of two matrices: the OV circuit (which describes what information the attention head reads and writes) and the QK circuit (which describes where to attend). We can describe both of these matrices in terms of features.

In describing these, we'll make heavy use of transformed sets of features. For example, we might have a feature X which detects some property, a feature prev(X) which detects that X was present at the previous token, and a feature say(X) which causes the model to produce an output that triggers X. These transformed features have a 1:1 correspondence with the corresponding original feature, and can be thought of as that feature with some relation applied. (In practice, we think the directions corresponding to these original features will often be a linearly transformed version of the original features. For motivation around this, see [Wattenberg & Viegas 2024](https://arxiv.org/pdf/2407.14662) on matrix binding and "echo" features.)

#### Copy Heads

To start, let's consider the OV circuit of a [copy head](https://transformer-circuits.pub/2021/framework/index.html#copying--primitive-in-context-learning). We expect it to lo

... (truncated, 30 KB total)
Resource ID: 0a2ab4f291c4a773 | Stable ID: NDJiODgxOT