circuit tracing research
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Transformer Circuits
Published by Anthropic on transformer-circuits.pub in 2025, this paper is a key empirical contribution to mechanistic interpretability research, extending circuit analysis techniques to production-scale language models using attribution graphs and SAE features.
Metadata
Summary
This Anthropic research paper presents attribution graphs as a method for mechanistic interpretability, tracing computational circuits within Claude to understand how specific behaviors and reasoning patterns emerge. It applies sparse autoencoder features and circuit-level analysis to reveal the internal 'biology' of a large language model, providing empirical case studies of how information flows through transformer components during various tasks.
Key Points
- •Introduces attribution graphs as a tool to trace computational circuits in Claude, mapping how features interact across layers to produce outputs.
- •Uses sparse autoencoder (SAE) features as interpretable units to decompose and analyze model internals at a mechanistic level.
- •Provides concrete case studies ('biology' of the model) showing how specific reasoning behaviors arise from identifiable internal circuits.
- •Bridges the gap between individual feature-level and whole-model interpretability by analyzing multi-step information flow.
- •Represents a significant step toward scalable mechanistic interpretability methods applicable to frontier AI systems.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Research Area | 66.0 |
Cached Content Preview
×
[Transformer Circuits Thread](https://transformer-circuits.pub/)
# On the Biology of a Large Language Model
# On the Biology of a Large Language Model
## We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
### Authors
Jack Lindsey†,Wes Gurnee\*,Emmanuel Ameisen\*,Brian Chen\*,Adam Pearce\*,Nicholas L. Turner\*,Craig Citro\*,
David Abrahams,Shan Carter,Basil Hosmer,Jonathan Marcus,Michael Sklar,Adly Templeton,
Trenton Bricken,Callum McDougall◊,Hoagy Cunningham,Thomas Henighan,Adam Jermyn,Andy Jones,Andrew Persic,Zhenyi Qi,T. Ben Thompson,
Sam Zimmerman,Kelley Rivoire,Thomas Conerly,Chris Olah,Joshua Batson\*‡
### Affiliations
[Anthropic](https://www.anthropic.com/)
### Published
March 27, 2025
† Lead Contributor;\\* Core Contributor;‡ Correspondence to [joshb@anthropic.com](mailto:joshb@anthropic.com);◊ Work performed while at Anthropic;[Author contributions statement below](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#appendix-author-contributions).
### Authors
### Affiliations
### Published
_Not published yet._
### DOI
_No DOI yet._
* * *
## § 1 [Introduction](https://transformer-circuits.pub/2025/attribution-graphs/biology.html\#introduction)
Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.
The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g.
- **Sparse Autoencoders Find Highly Interpretable Model Directions** [\[link\]](https://arxiv.org/pdf/2309.08600)
H. Cunningham, A. Ewart, L. Smith, R. Huben, L. Sharkey.
arXiv preprint arXiv:2309.08600. 2023.
- **Towards Monosemanticity: Decomposing Language Models With Dictionary Learning** [\[HTML\]](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn
... (truncated, 98 KB total)fbc2b9d822be9900 | Stable ID: ZWU0NWFhNz