Skip to content
Longterm Wiki
Back

Transformer Circuits Thread

paper

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

This is the canonical landing page for Anthropic's mechanistic interpretability research program; it serves as an index to all Transformer Circuits papers and updates and is essential reading for anyone studying AI internals for safety purposes.

Metadata

Importance: 88/100blog posthomepage

Summary

The Transformer Circuits Thread is Anthropic's primary publication hub for mechanistic interpretability research on large language models. It hosts foundational and ongoing research aimed at understanding the internal workings of transformer models, including work on circuits, features, sparse autoencoders, and attribution graphs. The thread represents a sustained research program toward making AI systems more understandable and safer.

Key Points

  • Central repository for Anthropic's mechanistic interpretability research, including landmark papers on circuits, superposition, and sparse autoencoders.
  • Hosts both major papers (e.g., 'Biology of a Large Language Model') and shorter monthly updates on incremental findings.
  • Research covers attention mechanisms, feature geometry, introspection in LLMs, transcoders, and attribution graphs.
  • Directly motivated by AI safety: the team aims to understand model internals to better plan for safe AI development.
  • Publishes accessible formats including toy models, cross-posts with alignment blog, and research updates.

Cited by 4 pages

PageTypeQuality
AnthropicOrganization74.0
Chris OlahPerson27.0
Mechanistic InterpretabilityResearch Area59.0
Probing / Linear ProbesApproach55.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202612 KB
Transformer Circuits Thread

[Transformer Circuits Thread](https://transformer-circuits.pub/)

# Anthropic’s Interpretability Research

A surprising fact about modern large language models is that nobody really knows how they work
internally.
The Interpretability team strives to change that — to understand these models to better plan for a
future of
safe AI.

December 2025

[**Circuits Cross-Post — Activation Oracles**\\
\\
We train language models to answer questions about their own activations in natural language.](https://alignment.anthropic.com/2025/activation-oracles)

November 2025

[**Circuits Updates — November 2025**\\
\\
A short update on harm pressure.](https://transformer-circuits.pub/2025/november-update/index.html)

October 2025

[![](https://transformer-circuits.pub/images/introspection.png)**Emergent Introspective Awareness in Large Language Models** \\
Lindsey, 2025\\
\\
We find evidence that language models can introspect on their internal states.](https://transformer-circuits.pub/2025/introspection/index.html) [**Circuits Updates — October 2025**\\
\\
Small updates on visual features and dictionary initialization.](https://transformer-circuits.pub/2025/october-update/index.html) [![](https://transformer-circuits.pub/images/linebreaks.png)**When Models Manipulate Manifolds: The Geometry of a Counting Task** \\
Gurnee et al., 2025\\
\\
We find geometric structure underlying the mechanisms of a fundamental language model behavior.](https://transformer-circuits.pub/2025/linebreaks/index.html)

September 2025

[**Circuits Updates — September 2025**\\
\\
A small update on features and in-context learning.](https://transformer-circuits.pub/2025/september-update/index.html)

August 2025

[**Circuits Updates — August 2025**\\
\\
A small update: How does a persona modify the assistant’s response?](https://transformer-circuits.pub/2025/august-update/index.html)

July 2025

[**A Toy Model of Mechanistic (Un)Faithfulness**\\
\\
When transcoders go awry.](https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html) [![](https://transformer-circuits.pub/images/attention-qk.png)**Tracing Attention Computation Through Feature Interactions** \\
Kamath et al., 2025\\
\\
We describe and apply a method to explain attention patterns in terms of\\
feature interactions, and integrate this information into attribution graphs.](https://transformer-circuits.pub/2025/attention-qk/index.html) [**A Toy Model of Interference Weights**\\
\\
Unpacking "interference weights" in some more depth.](https://transformer-circuits.pub/2025/interference-weights/index.html) [**Sparse mixtures of linear transforms**\\
\\
We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders.](https://transformer-circuits.pub/2025/bulk-update/index.html) [**Circuits Updates — July 2025**\\
\\
A collection of small updates: revisiting A Mathematical Framework and applications of\\
interpretability to biology.](https://transformer-circuits.pub/

... (truncated, 12 KB total)
Resource ID: 5083d746c2728ff2 | Stable ID: MGJhYTk1MW