Neel Nanda's Mechanistic Interpretability Research Hub

web

neelnanda.io·neelnanda.io/mechanistic-interpretability

Neel Nanda's mechanistic interpretability page aggregates his research posts, guides, and walkthroughs on understanding transformer internals, making it a central hub for researchers entering the mechanistic interpretability field.

Metadata

Importance: 82/100homepage

Summary

This page serves as a curated index of Neel Nanda's mechanistic interpretability work, including research posts on superposition, induction heads, attribution patching, and Othello-GPT, as well as introductory guides and paper walkthroughs. It covers both original research contributions and educational resources for newcomers to the field. The content spans from foundational explainers to cutting-edge empirical findings about how transformers represent and process information.

Key Points

•Hosts original research including attribution patching, emergent positional embeddings, and linear world representations in Othello-GPT.
•Provides comprehensive educational resources: quickstart guides, prerequisites, glossaries, and annotated reading lists for mechanistic interpretability.
•Includes walkthroughs of key papers like 'A Mathematical Framework for Transformer Circuits' and 'Toy Models of Superposition'.
•Covers practical research methodology including activation patching, circuit analysis, and reverse-engineering model computations.
•Serves as a primary entry point for researchers wanting to contribute to mechanistic interpretability as an AI safety research agenda.

Cited by 1 page

Page	Type	Quality
Probing / Linear Probes	Approach	55.0

Cached Content Preview

HTTP 200Fetched Apr 28, 20264 KB

Neel Nanda 
 
 7/18/23 
 
 
 
 
 
 
 Neel Nanda 
 
 7/18/23 
 
 
 
 
 
 
 
 
 
 Tiny Mech Interp Projects: Emergent Positional Embeddings of Words
 

 
 A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 3/28/23 
 
 
 
 
 
 
 Neel Nanda 
 
 3/28/23 
 
 
 
 
 
 
 
 
 
 Actually, Othello-GPT Has A Linear Emergent World Representation
 

 
 A write up of work extending and building on the paper Emergent World Representations

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Neel Nanda 
 
 3/12/23 
 
 
 
 
 
 
 Neel Nanda 
 
 3/12/23 
 
 
 
 
 
 
 
 
 
 Paper Replication Walkthrough: Reverse-Engineering Modular Addition
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 
 
 
 Attribution Patching: Activation Patching At Industrial Scale
 

 
 A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 Neel Nanda 
 
 2/4/23 
 
 
 
 
 
 
 
 
 
 Mech Interp Project Advising Call: Memorisation in GPT-2 Small
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 1/31/23 
 
 
 
 
 
 
 Neel Nanda 
 
 1/31/23 
 
 
 
 
 
 
 
 
 
 Mechanistic Interpretability Quickstart Guide
 

 
 An intro guide to a mechanistic interpretability weekend hackathon

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/27/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/27/22 
 
 
 
 
 
 
 
 
 
 A Walkthrough of Toy Models of Superposition
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/26/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/26/22 
 
 
 
 
 
 
 
 
 
 Analogies between Software Reverse Engineering and Mechanistic Interpretability
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/25/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/25/22 
 
 
 
 
 
 
 
 
 
 Concrete Steps to Get Started in Transformer Mechanistic Interpretability
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 12/21/22 
 
 
 
 
 
 
 Neel Nanda 
 
 12/21/22 
 
 
 
 
 
 
 
 
 
 A Comprehensive Mechanistic Interpretability Explainer & Glossary
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 11/22/22 
 
 
 
 
 
 
 Neel Nanda 
 
 11/22/22 
 
 
 
 
 
 
 
 
 
 A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 Neel Nanda 
 
 11/7/22 
 
 
 
 
 
 
 Neel Nanda 
 
 11/7/22 
 
 
 
 
 
 
 
 
 
 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
 

 
 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 11/1/22 
 
 
 
 
 
 
 Neel Nanda 
 
 11/1/22 
 
 
 
 
 
 
 
 
 
 Re

... (truncated, 4 KB total)

Resource ID: 46841681c285ec4c | Stable ID: sid_UI8NLzDXIA