Mechanistic interpretability work

web

Neel Nanda is one of the most prolific researchers in mechanistic interpretability; his homepage aggregates papers, blog posts, and tools that are frequently cited as entry points into the field.

Metadata

Importance: 72/100homepage

Summary

Neel Nanda's personal research homepage focused on mechanistic interpretability of neural networks, aiming to reverse-engineer how transformers and other models implement algorithms internally. His work includes foundational contributions like the discovery of grokking phenomena, superposition in neural networks, and developing TransformerLens, a key tool for interpretability research.

Key Points

•Leads mechanistic interpretability research at Google DeepMind, previously at Anthropic, focusing on understanding transformer internals
•Developed TransformerLens, a widely-used open-source library for interpretability research on GPT-style language models
•Contributed foundational work on superposition hypothesis explaining how neural nets compress more features than dimensions available
•Researched grokking — delayed generalization in neural networks — providing mechanistic explanations for this phenomenon
•Produces educational content, tutorials, and research agendas to grow the mechanistic interpretability research community

Cited by 2 pages

Page	Type	Quality
Neel Nanda	Person	26.0
Deceptive Alignment	Risk	75.0

Cached Content Preview

HTTP 200Fetched Jun 29, 20268 KB

Neel Nanda 
 
 8/19/25 
 
 
 
 
 
 
 Neel Nanda 
 
 8/19/25 
 
 
 
 
 
 
 
 
 
 MATS Applications Open (Due Aug 29)
 

 
 I am looking for people who want to be supervised by me to write a mech interp paper. Apply here now ! Due Aug 29

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 5/26/25 
 
 
 
 
 
 
 Neel Nanda 
 
 5/26/25 
 
 
 
 
 
 
 
 
 
 Post 51: Socratic Persuasion: Giving Opinionated Yet Truth-Seeking Advice
 

 
 I recommend giving advice by asking questions to walk someone through key steps in my argument — often I’m missing key info, which comes up quickly as an unexpected answer, while if I’m right I’m more persuasive

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 3/22/25 
 
 
 
 
 
 
 Neel Nanda 
 
 3/22/25 
 
 
 
 
 
 
 
 
 
 Post 50: Good Research Takes are Not Sufficient for Good Strategic Takes
 

 
 Having a good research track record is some evidence of good big-picture takes about AGI, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically.

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 8/18/22 
 
 
 
 
 
 
 Neel Nanda 
 
 8/18/22 
 
 
 
 
 
 
 
 
 
 Interlude: A Mechanistic Interpretability Analysis of Grokking
 

 
 Link post for some independent interpretability research I did

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 6/17/22 
 
 
 
 
 
 
 Neel Nanda 
 
 6/17/22 
 
 
 
 
 
 
 
 
 
 Post 49: Things That Make Me Enjoy Giving Career Advice
 

 
 Thoughts on what makes giving career advice more fun for me

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 6/15/22 
 
 
 
 
 
 
 Neel Nanda 
 
 6/15/22 
 
 
 
 
 
 
 
 
 
 Post 48: Prioritise Tasks by Rating not Sorting
 

 
 A short note on priority ordering tasks by rating them out of 10, rather than explicitly sorting into a priority order

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/27/22 
 
 
 
 
 
 
 Neel Nanda 
 
 2/27/22 
 
 
 
 
 
 
 
 
 
 Post 47: How I Formed My Own Views About AI Safety
 

 
 How I formed my own views on the complex topic of ‘will AI kill us all, and should I work on stopping this’, and traps I fell into

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/22/22 
 
 
 
 
 
 
 Neel Nanda 
 
 2/22/22 
 
 
 
 
 
 
 
 
 
 Post 46: Reward Good Bets That Had Bad Outcomes
 

 
 In many important areas of life, I want to persevere through many failures for a few big successes. As a highly anxious person, this is hard! I instead focus on whether I made a good bet, not whether it failed.

 
 
 

 Read More 

 
 
 
 
 
 
 
 
 
 
 Neel Nanda 
 
 2/11/22 
 
 
 
 
 
 
 Neel Nanda 
 
 2/11/22 
 
 
 
 
 
 
 
 
 
 Post 45: Simplify EA Pitches to “Holy Shit, X-Risk”
 

 
 I defend the key claims of “AI has a >=1% chance and biorisk has a 0.1% chance of causing x-risk within my lifetime”, and argue 

... (truncated, 8 KB total)

Resource ID: 028435b427f72e06 | Stable ID: sid_VWiB2bX3Fb