Skip to content
Longterm Wiki

Neel Nanda's Mechanistic Interpretability Research Hub

web

Neel Nanda's mechanistic interpretability page aggregates his research posts, guides, and walkthroughs on understanding transformer internals, making it a central hub for researchers entering the mechanistic interpretability field.

Metadata

Importance: 82/100homepage

Summary

This page serves as a curated index of Neel Nanda's mechanistic interpretability work, including research posts on superposition, induction heads, attribution patching, and Othello-GPT, as well as introductory guides and paper walkthroughs. It covers both original research contributions and educational resources for newcomers to the field. The content spans from foundational explainers to cutting-edge empirical findings about how transformers represent and process information.

Key Points

  • Hosts original research including attribution patching, emergent positional embeddings, and linear world representations in Othello-GPT.
  • Provides comprehensive educational resources: quickstart guides, prerequisites, glossaries, and annotated reading lists for mechanistic interpretability.
  • Includes walkthroughs of key papers like 'A Mathematical Framework for Transformer Circuits' and 'Toy Models of Superposition'.
  • Covers practical research methodology including activation patching, circuit analysis, and reverse-engineering model computations.
  • Serves as a primary entry point for researchers wanting to contribute to mechanistic interpretability as an AI safety research agenda.

Cited by 1 page

PageTypeQuality
Probing / Linear ProbesApproach55.0

Cached Content Preview

HTTP 200Fetched Apr 28, 20264 KB
Neel Nanda 
 
 7/18/23 
 
 
 
 
 
 
 Neel Nanda 
 
 7/18/23 
 
 
 
 
 
 
 
 
 
 Tiny Mech Interp Projects: Emergent Positional Embeddings of Words
 

 
 A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 3/28/23 
 
 
 
 
 
 
 Neel Nanda 
 
 3/28/23 
 
 
 
 
 
 
 
 
 
 Actually, Othello-GPT Has A Linear Emergent World Representation
 

 
 A write up of work extending and building on the paper Emergent World Representations

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Neel Nanda 
 
 3/12/23 
 
 
 
 
 
 
 Neel Nanda 
 
 3/12/23 
 
 
 
 
 
 
 
 
 
 Paper Replication Walkthrough: Reverse-Engineering Modular Addition
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 
 
 
 Attribution Patching: Activation Patching At Industrial Scale
 

 
 A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 
 
 
 Mech Interp Project Advising Call: Memorisation in GPT-2 Small
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 1/31/23 
 
 
 
 
 
 
 Neel Nanda 
 
 1/31/23 
 
 
 
 
 
 
 
 
 
 Mechanistic Interpretability Quickstart Guide
 

 
 An intro guide to a mechanistic interpretability weekend hackathon

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/27/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/27/22 
 
 
 
 
 
 
 
 
 
 A Walkthrough of Toy Models of Superposition
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/26/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/26/22 
 
 
 
 
 
 
 
 
 
 Analogies between Software Reverse Engineering and Mechanistic Interpretability
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/25/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/25/22 
 
 
 
 
 
 
 
 
 
 Concrete Steps to Get Started in Transformer Mechanistic Interpretability
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/21/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/21/22 
 
 
 
 
 
 
 
 
 
 A Comprehensive Mechanistic Interpretability Explainer & Glossary
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 11/22/22 
 
 
 
 
 
 
 Neel Nanda 
 
 11/22/22 
 
 
 
 
 
 
 
 
 
 A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Neel Nanda 
 
 11/7/22 
 
 
 
 
 
 
 Neel Nanda 
 
 11/7/22 
 
 
 
 
 
 
 
 
 
 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 11/1/22 
 
 
 
 
 
 
 Neel Nanda 
 
 11/1/22 
 
 
 
 
 
 
 
 
 
 Re

... (truncated, 4 KB total)
Resource ID: 46841681c285ec4c | Stable ID: sid_UI8NLzDXIA