circuit tracing research

web

Transformer Circuits·transformer-circuits.pub/2025/attribution-graphs/biology....

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

Published by Anthropic on transformer-circuits.pub in 2025, this paper is a key empirical contribution to mechanistic interpretability research, extending circuit analysis techniques to production-scale language models using attribution graphs and SAE features.

Metadata

Importance: 85/100blog postprimary source

Summary

This Anthropic research paper presents attribution graphs as a method for mechanistic interpretability, tracing computational circuits within Claude to understand how specific behaviors and reasoning patterns emerge. It applies sparse autoencoder features and circuit-level analysis to reveal the internal 'biology' of a large language model, providing empirical case studies of how information flows through transformer components during various tasks.

Key Points

•Introduces attribution graphs as a tool to trace computational circuits in Claude, mapping how features interact across layers to produce outputs.
•Uses sparse autoencoder (SAE) features as interpretable units to decompose and analyze model internals at a mechanistic level.
•Provides concrete case studies ('biology' of the model) showing how specific reasoning behaviors arise from identifiable internal circuits.
•Bridges the gap between individual feature-level and whole-model interpretability by analyzing multi-step information flow.
•Represents a significant step toward scalable mechanistic interpretability methods applicable to frontier AI systems.

Cited by 1 page

Page	Type	Quality
Interpretability	Research Area	66.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202698 KB

On the Biology of a Large Language Model 
 
 
 
 
 

 

 
 
 
 

 

 
 

 

 

 
 

 

 

 

 
 
 
 × 
 

 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 On the Biology of a Large Language Model

 On the Biology of a Large Language Model

 We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

 Authors

 Jack Lindsey † , Wes Gurnee * , Emmanuel Ameisen * , Brian Chen * , Adam Pearce * , Nicholas L. Turner * , Craig Citro * , David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall ◊ , Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson *‡ Affiliations

 Anthropic Published

 March 27, 2025 † Lead Contributor; * Core Contributor; ‡ Correspondence to joshb@anthropic.com ; ◊ Work performed while at Anthropic; Author contributions statement below . 

 

 Introduction 

 Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.

 The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.

 Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models ( e.g.   ). These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models. The analogy between features and cells shouldn’t be taken too literally. Cells are well-defined, whereas our notion of what exactly a “feature” is remains fuzzy, and is evolving with improvements to our tools. 

 However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact. In our companion paper, Circuit Tracing: Revealing Computational

... (truncated, 98 KB total)

Resource ID: fbc2b9d822be9900 | Stable ID: sid_KSk76WVuP7