Neel Nanda
Neel Nanda
Comprehensive biographical profile of Neel Nanda covering his role as DeepMind's mechanistic interpretability team lead, key contributions (TransformerLens, Gemma Scope, grokking paper), and his evolving views on interpretability's limitations and strategic pivot toward applied safety problems. Notably includes his 2025 admission that some high-profile mech interp results were 'wrong or significantly weaker than originally claimed' and that the 'most ambitious vision of the field is probably dead.'
Quick Assessment
| Aspect | Assessment |
|---|---|
| Primary Role | Senior Research Scientist and Mechanistic Interpretability Team Lead, Google DeepMind (2023–present) |
| Key Contributions | Creator of TransformerLens (∼3,100 GitHub stars, 112+ contributors); co-author of "A Mathematical Framework for Transformer Circuits" (2021); lead on Gemma Scope (400+ open sparse autoencoders); ICLR 2023 Spotlight for grokking paper |
| Key Publications | "A Mathematical Framework for Transformer Circuits" (2021); "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023 Spotlight); "Towards Principled Evaluations of Sparse Autoencoders (SAEs)" (ICLR 2025); "SAEBench" (ICML 2025) |
| Institutional Affiliation | Google DeepMind; previously Anthropic (2021–2022) |
| Influence on AI Safety | Builds open-source tooling and training curricula for mechanistic interpretability; has mentored approximately 50 junior researchers, with 7 subsequently placed at major AI companies; named to MIT Technology Review Innovators Under 35 (2025) |
Overview
Neel Nanda is a Senior Research Scientist and Mechanistic Interpretability Team Lead at Google DeepMind, where he leads research into reverse-engineering neural networks to understand how they implement algorithms. He studied Mathematics at Trinity College, Cambridge (2017–2020), then worked at Anthropic as a language model interpretability researcher under Chris Olah (2021–2022), before joining DeepMind's mechanistic interpretability team in 2023.12
He is the creator of TransformerLens, an open-source library that has become a widely used tool in the interpretability research community, and a co-author of "A Mathematical Framework for Transformer Circuits," an influential Anthropic paper introducing a mathematical vocabulary for analyzing transformer behavior. His team at DeepMind produced Gemma Scope, a collection of over 400 openly released sparse autoencoders for Gemma 2 models.3 As of 2025, Nanda had accrued over 13,000 citations on Google Scholar, an h-index of 31, and had co-authored papers appearing at NeurIPS, ICLR, and ICML.45
Nanda has also been active in training the next generation of interpretability researchers, having mentored approximately 50 junior researchers through programs including SERI MATS, with 15 co-authored papers published at top ML venues and 7 mentees subsequently placed at major AI companies.6
Background
Neel Nanda studied Mathematics at Trinity College, Cambridge from 2017 to 2020.1 Following graduation, he joined Anthropic in 2021 as a researcher in the interpretability team, working under Chris Olah on language model interpretability.2 He left Anthropic in 2022 and did a period of independent mechanistic interpretability research before joining Google DeepMind's mechanistic interpretability team in 2023.12
At DeepMind, Nanda joined expecting to be an individual contributor. When the prior team lead stepped down, he took on the role of team lead despite having no prior management experience.78 As of 2025, he holds the title of Senior Research Scientist and Mechanistic Interpretability Team Lead.47
Nanda maintains an active presence in the AI alignment research community through LessWrong and the Alignment Forum, where he publishes tutorials and research updates.
Research Contributions
TransformerLens Library
Nanda created TransformerLens, an open-source Python library for interpretability research on transformer models.9 The library was originally created in 2022 and is now maintained by a broader contributor community under the TransformerLensOrg GitHub organization, with Bryce Meyer as a key maintainer.10
As of 2025, the repository has approximately 3,100 GitHub stars, 513 forks, and 112 contributors, with 796 closed pull requests.10 The library provides:
- Programmatic access to model activations at each layer
- Hook functions for intervention experiments
- Support for loading 50+ open-source language models from Hugging Face
- Visualization utilities for attention patterns
The library is formally cited as: Nanda, N. & Bloom, J. (2022). TransformerLens.10 A companion CircuitsVis repository provides additional visualization tools and has accrued 297 stars and 40 forks.11 Version 2.0 removed Hooked SAE functionality, which was migrated to the separate SAELens library.10
Alternative interpretability libraries exist (e.g., baukit, pyvene, nnsight), though TransformerLens's hook-based API and extensive model support have made it a common choice for circuit-analysis research.
Transformer Circuits Research
Nanda is a co-author of "A Mathematical Framework for Transformer Circuits" (2021), published by the Anthropic interpretability team.12 The paper lists Nelson Elhage, Neel Nanda, and Catherine Olsson as core research contributors, with additional contributors including Tom Henighan, Nicholas Joseph, Ben Mann, and others; Chris Olah handled correspondence.13 Nanda has described it as "the coolest paper I've ever had the privilege of working on" and "the most awesome paper I've been involved in."14
The research:
- Introduced a mathematically equivalent framework for understanding transformer operations
- Identified "induction heads" as circuits that enable in-context learning
- Demonstrated that attention mechanisms compose to perform multi-step reasoning
- Showed that induction heads only develop in models with at least two attention layers
- Provided mathematical descriptions of how models track positional and semantic information
The paper built on earlier work from Anthropic's interpretability team examining circuits in vision models.
Grokking and Modular Arithmetic
Nanda co-authored "Progress Measures for Grokking via Mechanistic Interpretability" (January 2023, with Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt), which received an ICLR 2023 Spotlight designation.15 The paper investigates the "grokking" phenomenon—a delayed generalization observed in small transformers trained on modular arithmetic—and fully reverse-engineers the algorithm learned, using discrete Fourier transforms and trigonometric identities.15
Additional Research Areas
Nanda has published work on:
- Indirect Object Identification: The paper "Interpretability in the Wild" (from Redwood Research, with Kevin Wang, Arthur Conmy, and Alexandre Variengien) reverse-engineers a 26-head circuit in GPT-2 Small for identifying indirect objects in sentences (e.g., predicting "Mary" in "After John and Mary went to the shops, John gave a bottle of milk to ___"). Nanda hosted a multi-part walkthrough of this paper with its authors and used it extensively as a case study in his research agenda posts.16
- Sparse Autoencoders (SAEs): Nanda's DeepMind team has produced a series of SAE papers including Gated SAEs (NeurIPS 2024, led by Sen Rajamanoharan)17, JumpReLU SAEs17, and SAEBench (ICML 2025)18, as well as "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control" (ICLR 2025)18.
- Gemma Scope: A collection of over 400 freely available open SAEs for Gemma 2 9B and Gemma 2 2B, authored with Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar, and others.3 The goal was to enable the broader safety research community to explore model internals using open-source models and tools.
Publication Record
As of 2025, Nanda's Google Scholar profile lists over 13,000 total citations and identifies him as Mechanistic Interpretability Team Lead at Google DeepMind.4 His Semantic Scholar profile lists 62 publications, an h-index of 31, and 7,759 total citations including 880 highly influential citations.5 Peer-reviewed venues where his work has appeared include:
| Venue | Year | Paper |
|---|---|---|
| ICLR (Spotlight) | 2023 | Progress Measures for Grokking via Mechanistic Interpretability |
| ICML | 2023 | A Toy Model of Universality |
| NeurIPS | 2024 | Gated SAEs (led by Sen Rajamanoharan) |
| ICLR | 2025 | Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control |
| ICLR | 2025 | Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models |
| ICML | 2025 | SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability |
| ICML | 2025 | Learning Multi-Level Features with Matryoshka Sparse Autoencoders |
| ICML | 2025 | Are Sparse Autoencoders Useful? A Case Study in Sparse Probing |
Sources: DBLP18, Semantic Scholar5
Educational Content
Nanda publishes interpretability tutorials and explanations on his blog at neelnanda.io and through video content. His "200 Concrete Open Problems in Mechanistic Interpretability" sequence, published on the Alignment Forum from December 28, 2022 through January 19, 2023, outlines research directions for the field.1920 The sequence covers topics including: Concrete Steps to Get Started, The Case for Analysing Toy Language Models, Looking for Circuits in the Wild, Interpreting Algorithmic Problems, Exploring Polysemanticity and Superposition, Analysing Training Dynamics, Techniques Tooling and Automation, Image Model Interpretability, Interpreting Reinforcement Learning, and Studying Learned Features in Language Models.20 A Manifold Markets prediction market tracking progress on these problems resolved at 14% solved by 2025.21
Nanda has written guides for newcomers to interpretability research, including walkthroughs of TransformerLens usage and explanations of foundational papers. These materials are distributed as blog posts, Jupyter notebooks, and video tutorials.
His materials have been incorporated into the ARENA (Alignment Research Engineer Accelerator) curriculum, a 4–5 week ML bootcamp focused on AI safety. The interpretability chapter of ARENA is largely based on Nanda's open-source material; during ARENA 2.0, 6 of 15 participants in Nanda's SERI MATS cohort rated ARENA content among their top 3 most helpful resources.22 ARENA has continued to run annually (ARENA 7.0 ran January–February 2026), citing the work of Nanda and the Google DeepMind Interpretability Team in its interpretability curriculum.23
Through the SERI MATS mentorship program, Nanda has mentored approximately 50 junior researchers over roughly three years, supervised 30+ MATS papers, and co-authored 15 papers published at top ML venues. As of his 2024 MATS applications post, 7 mentees had gone on to work at major AI companies.6
Research Approach
Nanda's work emphasizes:
- Circuit discovery: Identifying specific subnetworks responsible for model behaviors
- Mechanistic explanations: Describing algorithms implemented by neural networks in mathematical or computational terms
- Open-source tooling: Building software infrastructure to enable interpretability experiments
- Reproducible examples: Providing code and notebooks that others can run and modify
His stated view is that understanding neural network internals is necessary for evaluating AI safety properties, though he acknowledges that current interpretability techniques may not directly transfer to more capable future systems.24 Since approximately mid-2024, Nanda has described a strategic shift: his team's focus has moved from basic science (ambitious reverse-engineering of circuits) toward applying interpretability skills to real-world safety problems such as model monitoring.25 He has stated that the team "does not particularly care" whether it is called the "Mechanistic Interpretability Team"—the priority is applying available skills to the most impactful safety problems.25
Perspectives on Interpretability and Alignment
In posts and presentations, Nanda has outlined reasons why mechanistic interpretability research may contribute to AI alignment. He attributes four potential applications to interpretability:
- Failure diagnosis: Identifying the mechanisms behind unexpected model behaviors
- Capability evaluation: Determining what tasks models can perform by examining their internal algorithms
- Deception detection: Searching for representations that indicate models are optimizing for objectives different from their training signal
- Verification: Checking whether specific safety properties hold in a model's learned algorithms
He notes that these applications remain speculative and that the field is in early stages of developing techniques that scale to frontier models.26 Researchers have raised specific questions about whether circuit-level analysis scales to deception detection in practice; a LessWrong critique notes that "when researchers claim to identify a subnetwork performing a specific task, we should be very cautious — it's easy to find circuits in networks that do arbitrary things without doing anything more impressive than performance-guided network compression."27 Nanda engaged with that critique and is thanked in the post for correcting a mistake in an earlier draft.27
Nanda stated in 2025 that there have been "very exciting, widely popularized results from mechanistic interpretability that subsequent research has shown are either wrong or significantly weaker than originally claimed," and that the most ambitious vision of the field is "probably dead"—he no longer sees a path to deeply and reliably understanding what AI systems are thinking before competitive pressures push deployment of superhuman AI.28 He now describes interpretability as functioning like "model biology"—a tool for determining whether a model's behavior warrants concern—and advocates for a "Swiss cheese" approach of layering multiple safeguards rather than relying on any single alignment technique.28
On AI Timelines, Nanda has stated a deliberate policy of not publicly disclosing his p(doom) or specific timeline estimates, citing concern that prominent researchers' views cause others to anchor without examining whether interpretability expertise translates to timeline expertise.8 He has said he considers AGI arriving in the next 10–20 years, sooner, or later as all plausible scenarios, and argues that safety work should proceed as though all are realistic.8
Current Work
At Google DeepMind, Nanda holds the title of Senior Research Scientist and Mechanistic Interpretability Team Lead.74 Team outputs during his tenure include Gemma Scope (400+ open-weight SAEs released July 2024)3, Gated SAEs (NeurIPS 2024)17, JumpReLU SAEs17, and SAEBench (ICML 2025)18. Recent papers from the team also include "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" and "Learning Multi-Level Features with Matryoshka Sparse Autoencoders" (both 2025).18
In 2025, Nanda was named to MIT Technology Review's Innovators Under 35 list for his mechanistic interpretability research and work to build the field.29
Limitations and Open Questions
Nanda has identified challenges for mechanistic interpretability research:30
- Current techniques are labor-intensive and may not scale to models with hundreds of billions of parameters
- Many discovered circuits are descriptive rather than predictive, explaining behavior post-hoc without enabling intervention
- It remains unclear whether interpretability of current models will provide insights applicable to future AI systems with different architectures or training methods
External critiques of the broader research program include the concern that circuits identified through activation-based methods may not correspond to causally meaningful computational units, and that the dominant focus on circuits-style interpretability narrows the field's approach to model verification and deceptive alignment detection.27 Nanda himself has acknowledged that some earlier high-profile results in the field were "wrong or significantly weaker than originally claimed."28
Questions about whether interpretability should be prioritized relative to other AI Alignment such as scalable oversight or Constitutional AI remain open; Nanda argues for treating these as complementary rather than competing.28
External Links
- neelnanda.io — personal website, blog posts, and tutorials
- TransformerLens GitHub — library source code and documentation
- 200 Concrete Open Problems in Mechanistic Interpretability — Alignment Forum sequence (December 2022–January 2023)
- Google Scholar profile — full publication list
- LessWrong posts — research updates and explanations
Footnotes
-
Neel Nanda, OpenReview Profile. https://openreview.net/profile?id=~Neel_Nanda1 ↩ ↩2 ↩3
-
Neel Nanda. "About." neelnanda.io. https://www.neelnanda.io/about ↩ ↩2 ↩3
-
Google DeepMind. "Gemma Scope: Helping the Safety Community Shed Light on the Inner Workings of Language Models." DeepMind Blog, July 2024. https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/ ↩ ↩2 ↩3
-
Neel Nanda, Google Scholar. https://scholar.google.com/citations?user=GLnX3MkAAAAJ&hl=en ↩ ↩2 ↩3 ↩4
-
Neel Nanda, Semantic Scholar. https://www.semanticscholar.org/author/Neel-Nanda/2051128902 ↩ ↩2 ↩3
-
Neel Nanda. "MATS Applications." neelnanda.io, 2024. https://www.neelnanda.io/blog/mats-apps-9 ↩ ↩2
-
Neel Nanda, LinkedIn. https://uk.linkedin.com/in/neel-nanda%F0%9F%94%B8-993580151 ↩ ↩2 ↩3
-
Rob Wiblin / 80,000 Hours. "Neel Nanda on leading a Google DeepMind team at 26 (Part 2)." 80,000 Hours Podcast, October 21, 2025. https://80000hours.org/podcast/episodes/neel-nanda-career-advice-frontier-ai-companies/ ↩ ↩2 ↩3
-
TransformerLensOrg/TransformerLens, GitHub. https://github.com/TransformerLensOrg/TransformerLens ↩
-
TransformerLensOrg/TransformerLens, GitHub. https://github.com/TransformerLensOrg/TransformerLens ↩ ↩2 ↩3 ↩4
-
TransformerLens Official Documentation. https://transformerlensorg.github.io/TransformerLens/ ↩
-
Elhage, Nanda, Olsson et al. "A Mathematical Framework for Transformer Circuits." Transformer Circuits Thread, December 22, 2021. https://transformer-circuits.pub/2021/framework/index.html ↩
-
Elhage, Nanda, Olsson et al. "A Mathematical Framework for Transformer Circuits." December 22, 2021. https://transformer-circuits.pub/2021/framework/index.html ↩
-
Neel Nanda. "A Walkthrough of A Mathematical Framework for Transformer Circuits." LessWrong, October 14, 2022. https://www.lesswrong.com/posts/hBtjpY2wAASEpZXgN/a-walkthrough-of-a-mathematical-framework-for-transformer ↩
-
Nanda, Chan, Lieberum, Smith, Steinhardt. "Progress Measures for Grokking via Mechanistic Interpretability." arXiv:2301.05217, January 2023. https://arxiv.org/abs/2301.05217 ↩ ↩2
-
Neel Nanda. "A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)." LessWrong, 2022. https://www.lesswrong.com/posts/DZk6mRo9vhCXN9Rfn/a-walkthrough-of-interpretability-in-the-wild-w-authors ↩
-
MIT Technology Review. "Google DeepMind Has a New Way to Look Inside an AI's 'Mind'." November 14, 2024. https://www.technologyreview.com/2024/11/14/1106871/google-deepmind-has-a-new-way-to-look-inside-an-ais-mind/ ↩ ↩2 ↩3 ↩4
-
Neel Nanda, DBLP. https://dblp.org/pid/285/6389.html ↩ ↩2 ↩3 ↩4 ↩5
-
Neel Nanda. "200 Concrete Open Problems in Mechanistic Interpretability." AI Alignment Forum sequence, December 28, 2022 – January 19, 2023. https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj ↩
-
Neel Nanda. "200 Concrete Open Problems in Mechanistic Interpretability." AI Alignment Forum sequence, December 28, 2022 – January 19, 2023. https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj ↩ ↩2
-
Manifold Markets. "By 2025, Percent of 200 Concrete Open Problems Solved?" https://manifold.markets/duck_master/by-2025-percent-of-200-concrete-ope ↩
-
ARENA team. "ARENA 2.0 - Impact Report." EA Forum, September 26, 2023. https://forum.effectivealtruism.org/posts/C7DbrkCpSe4AdcMek/arena-2-0-impact-report ↩
-
ARENA team. "ARENA 7.0 - Call for Applicants." EA Forum, September 30, 2025. https://forum.effectivealtruism.org/posts/MDw6QDyrLP3kpehd2/arena-7-0-call-for-applicants ↩
-
Neel Nanda. "A Pragmatic Vision for Interpretability." LessWrong, 2024. https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability ↩
-
Neel Nanda. "A Pragmatic Vision for Interpretability." LessWrong, 2024. https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability ↩ ↩2
-
Neel Nanda. "200 Concrete Open Problems in Mechanistic Interpretability." AI Alignment Forum sequence. https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj ↩
-
LessWrong. "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety." https://www.lesswrong.com/posts/wt7HXaCWzuKQipqz3/eis-vi-critiques-of-mechanistic-interpretability-work-in-ai ↩ ↩2 ↩3
-
Rob Wiblin / 80,000 Hours. "Neel Nanda on the Race to Read AI Minds (Part 1)." 80,000 Hours Podcast, October 14, 2025. https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/ ↩ ↩2 ↩3 ↩4
-
MIT Technology Review. "Neel Nanda." Innovators Under 35, 2025. https://www.technologyreview.com/innovator/neel-nanda/ ↩
-
Neel Nanda, 80,000 Hours Podcast Part 1. "Neel Nanda on the Race to Read AI Minds." October 14, 2025. https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/ ↩
References
Neel Nanda's personal research homepage focused on mechanistic interpretability of neural networks, aiming to reverse-engineer how transformers and other models implement algorithms internally. His work includes foundational contributions like the discovery of grokking phenomena, superposition in neural networks, and developing TransformerLens, a key tool for interpretability research.