Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

A foundational paper from Anthropic's Transformer Circuits thread; widely considered a landmark in mechanistic interpretability and essential reading for anyone studying how Transformer models generalize from context.

Metadata

Importance: 92/100blog postprimary source

Summary

This paper introduces 'induction heads,' a two-attention-head circuit in Transformers that enables in-context learning by identifying repeated token patterns and predicting their continuations. The authors provide mechanistic evidence that induction heads are the primary driver of in-context learning in small attention-only models, and present suggestive evidence this mechanism extends to large language models. The work is a landmark contribution to mechanistic interpretability as a path toward understanding and improving AI safety.

Key Points

  • Induction heads implement a pattern-completion circuit: given [A][B]...[A], they predict [B] by looking back for prior occurrences of the current token.
  • The circuit requires two attention heads in composition: a 'previous token head' and the induction head itself, meaning it cannot exist in single-layer models.
  • Induction heads appear to be the primary mechanism behind in-context learning in small attention-only Transformers, with evidence suggesting relevance to larger models.
  • A phase change in training corresponds to the sudden formation of induction heads, coinciding with a sharp improvement in in-context learning ability.
  • Mechanistic interpretability—reverse-engineering internal model computations—is framed as a key approach to systematically addressing AI safety problems.

Cited by 1 page

Cached Content Preview

HTTP 200Fetched Mar 15, 202698 KB
[Transformer Circuits Thread](https://transformer-circuits.pub/)

# In-context Learning and Induction Heads

### Authors

Catherine Olsson∗,[Nelson Elhage∗](https://nelhage.com/),[Neel Nanda∗](https://www.neelnanda.io/),Nicholas Joseph†,Nova DasSarma†,Tom Henighan†,[Ben Mann†](https://benjmann.net/),Amanda Askell,Yuntao Bai,Anna Chen,Tom Conerly,Dawn Drain,Deep Ganguli,[Zac Hatfield-Dodds](https://zhd.dev/),Danny Hernandez,Scott Johnston,Andy Jones,Jackson Kernion,Liane Lovitt,Kamal Ndousse,Dario Amodei,Tom Brown,Jack Clark,Jared Kaplan,Sam McCandlish,[Chris Olah‡](https://colah.github.io/)

### Affiliation

[Anthropic](https://www.anthropic.com/)

### Published

Mar 8, 2022

\\* Core Research Contributor;† Core Infrastructure Contributor;‡ Correspondence to [colah@anthropic.com](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/colah@anthropic.com);[Author contributions statement below](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#author-contributions).

As Transformer generative models continue to scale and gain increasing real world use , addressing their associated safety problems becomes increasingly important. Mechanistic interpretability – attempting to reverse engineer the detailed computations performed by the model – offers one possible avenue for addressing these safety issues. If we can understand the internal structures that cause Transformer models to produce the outputs they do, then we may be able to address current safety problems more systematically, as well as anticipating safety problems in future more powerful models. Note that mechanistic interpretability is a subset of the broader field of interpretability, which encompasses many different methods for explaining the outputs of a neural network. Mechanistic interpretability is distinguished by a specific focus on trying to systematically characterize the internal circuitry of a neural net.

In the past, mechanistic interpretability has largely focused on CNN vision models, but recently, we [presented](https://transformer-circuits.pub/2021/framework/index.html) some very preliminary progress on mechanistic interpretability for Transformer language models​​. Specifically, in our prior work we developed a mathematical framework for decomposing the operations of transformers, which allowed us to make sense of small (1 and 2 layer attention-only) models and give a near-complete account of how they function. Perhaps the most interesting finding was the induction head, a circuit whose function is to look back over the sequence for previous instances of the current token (call it `A`), find the token that came after it last time (call it `B`), and then predict that the same completion will occur again (e.g. forming the sequence `[A][B] … [A] → [B]`). In other words, induction heads “complete the pattern” by copying and completing sequences that have occurred before. Mechanically, induction heads in our models ar

... (truncated, 98 KB total)
Resource ID: 23e5123e7f8f98e2 | Stable ID: MjUyMWU3Mz