In-context Learning and Induction Heads

web

Transformer Circuits·transformer-circuits.pub/2022/in-context-learning-and-ind...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

A foundational paper from Anthropic's Transformer Circuits thread; widely considered a landmark in mechanistic interpretability and essential reading for anyone studying how Transformer models generalize from context.

Metadata

Importance: 92/100blog postprimary source

Summary

This paper introduces 'induction heads,' a two-attention-head circuit in Transformers that enables in-context learning by identifying repeated token patterns and predicting their continuations. The authors provide mechanistic evidence that induction heads are the primary driver of in-context learning in small attention-only models, and present suggestive evidence this mechanism extends to large language models. The work is a landmark contribution to mechanistic interpretability as a path toward understanding and improving AI safety.

Key Points

•Induction heads implement a pattern-completion circuit: given [A][B]...[A], they predict [B] by looking back for prior occurrences of the current token.
•The circuit requires two attention heads in composition: a 'previous token head' and the induction head itself, meaning it cannot exist in single-layer models.
•Induction heads appear to be the primary mechanism behind in-context learning in small attention-only Transformers, with evidence suggesting relevance to larger models.
•A phase change in training corresponds to the sudden formation of induction heads, coinciding with a sharp improvement in in-context learning ability.
•Mechanistic interpretability—reverse-engineering internal model computations—is framed as a key approach to systematically addressing AI safety problems.

Cited by 1 page

Page	Type	Quality
Is Interpretability Sufficient for Safety?	Crux	49.0

Cached Content Preview

HTTP 200Fetched May 30, 202698 KB

In-context Learning and Induction Heads 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 In-context Learning and Induction Heads

 
 
 

 

 Authors

 Catherine Olsson ∗ , Nelson Elhage ∗ , Neel Nanda ∗ , Nicholas Joseph † , Nova DasSarma † , Tom Henighan † , Ben Mann † , Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds , Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah ‡ Affiliation

 Anthropic Published

 Mar 8, 2022 * Core Research Contributor; † Core Infrastructure Contributor; ‡ Correspondence to colah@anthropic.com ; Author contributions statement below . 
 
 

 
 As Transformer generative models continue to scale and gain increasing real world use , addressing their associated safety problems becomes increasingly important. Mechanistic interpretability  – attempting to reverse engineer the detailed computations performed by the model – offers one possible avenue for addressing these safety issues. If we can understand the internal structures that cause Transformer models to produce the outputs they do, then we may be able to address current safety problems more systematically, as well as anticipating safety problems in future more powerful models. Note that mechanistic interpretability is a subset of the broader field of interpretability , which encompasses many different methods for explaining the outputs of a neural network. Mechanistic interpretability is distinguished by a specific focus on trying to systematically characterize the internal circuitry of a neural net. 

 In the past, mechanistic interpretability has largely focused on CNN vision models , but recently, we presented some very preliminary progress on mechanistic interpretability for Transformer language models . Specifically, in our prior work we developed a mathematical framework for decomposing the operations of transformers, which allowed us to make sense of small (1 and 2 layer attention-only) models and give a near-complete account of how they function. Perhaps the most interesting finding was the induction head , a circuit whose function is to look back over the sequence for previous instances of the current token (call it A ), find the token that came after it last time (call it B ), and then predict that the same completion will occur again (e.g. forming the sequence [A][B] … [A] → [B] ). In other words, induction heads “complete the pattern” by copying and completing sequences that have occurred before. Mechanically, induction heads in our models are implemented by a circuit of two attention heads: the first head is a “previous token head” which copies information from the previous token into the next token, while the second head (the actual “induction head”) uses that information to find tokens preceded by the present token. For 2-layer attention-only models, Note that

... (truncated, 98 KB total)

Resource ID: 23e5123e7f8f98e2 | Stable ID: sid_fbKZnuqUdL