Skip to content
Longterm Wiki
Navigation
Updated 2026-03-16HistoryData
Citations verified3 accurate1 flagged
Page StatusContent
Edited 3 weeks ago2.6k words16 backlinks
26QualityDraft •84.5ImportanceHigh40.5ResearchLow
Content8/13
SummaryScheduleEntityEdit history3Overview
Tables2/ ~11Diagrams0/ ~1Int. links31/ ~21Ext. links8/ ~13Footnotes30/ ~8References1/ ~8Quotes4/4Accuracy4/4RatingsN:2 R:3 A:2.5 C:4.5Backlinks16
Change History3
Citation pipeline improvements and footnote normalization6 weeks ago

Fixed citation extraction to handle all footnote formats (text+bare URL), created a footnote normalization script that auto-converted 58 non-standard footnotes to markdown-link format, switched dashboard export from JSON/.cache to YAML/data/ for production compatibility, ran the citation accuracy pipeline on 5 pages (rethink-priorities, cea, compute-governance, hewlett-foundation, center-for-applied-rationality) producing 232 citation checks with 57% accurate, 16% flagged, re-verified colorado-ai-act archive outside sandbox (18/19 verified), and improved difficulty distribution to use structured categories (easy/medium/hard) with normalization fallback.

claude-opus-4-6 · ~1h

Surface tacticalValue in /wiki table and score 53 pages7 weeks ago

Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.

sonnet-4 · ~30min

Wiki editing system refactoring#1847 weeks ago

Six refactors to the wiki editing pipeline: (1) extracted shared regex patterns to `crux/lib/patterns.ts`, (2) refactored validation in page-improver to use in-process engine calls instead of subprocess spawning, (3) split the 694-line `phases.ts` into 7 individual phase modules under `phases/`, (4) created shared LLM abstraction `crux/lib/llm.ts` unifying duplicated streaming/retry/tool-loop code, (5) added Zod schemas for LLM JSON response validation, (6) decomposed 820-line mermaid validation into `crux/lib/mermaid-checks.ts` (604 lines) + slim orchestrator (281 lines). Follow-up review integrated patterns.ts across 19+ files, fixed dead imports, corrected ToolHandler type, wired mdx-utils.ts to use shared patterns, replaced hardcoded model strings with MODELS constants, replaced `new Anthropic()` with `createLlmClient()`, replaced inline `extractText` implementations with shared `extractText()` from llm.ts, integrated `MARKDOWN_LINK_RE` into link validators, added `objectivityIssues` to the `AnalysisResult` type (removing an unsafe cast in utils.ts), fixed CI failure from eager client creation, and tested the full pipeline by improving 3 wiki pages. After manual review of 3 improved pages, fixed 8 systematic pipeline issues: (1) added content preservation instructions to prevent polish-tier content loss, (2) made auto-grading default after --apply, (3) added polish-tier citation suppression to prevent fabricated citations, (4) added Quick Assessment table requirement for person pages, (5) added required Overview section enforcement, (6) added section deduplication and content repetition checks to review phase, (7) added bare URL→markdown link conversion instruction, (8) extended biographical claim checker to catch publication/co-authorship and citation count claims. Subsequent iterative testing and prompt refinement: ran pipeline on jan-leike, chris-olah, far-ai pages. Discovered and fixed: (a) `<!-- NEEDS CITATION -->` HTML comments break MDX compilation (changed to `{/* NEEDS CITATION */}`), (b) excessive citation markers at polish tier — added instruction to only mark NEW claims (max 3-5 per page), (c) editorial meta-comments cluttering output — added no-meta-comments instruction, (d) thin padding sections — added anti-padding instruction, (e) section deduplication needed stronger emphasis — added merge instruction with common patterns. Final test results: jan-leike 1254→1997 words, chris-olah 1187→1687 words, far-ai 1519→2783 words, miri-era 2678→4338 words; all MDX compile, zero critical issues.

Issues2
QualityRated 26 but structure suggests 93 (underrated by 67 points)
Links2 links could use <R> components

Neel Nanda

Person

Neel Nanda

Comprehensive biographical profile of Neel Nanda covering his role as DeepMind's mechanistic interpretability team lead, key contributions (TransformerLens, Gemma Scope, grokking paper), and his evolving views on interpretability's limitations and strategic pivot toward applied safety problems. Notably includes his 2025 admission that some high-profile mech interp results were 'wrong or significantly weaker than originally claimed' and that the 'most ambitious vision of the field is probably dead.'

AffiliationGoogle DeepMind
RoleResearch Scientist, Google DeepMind
Known ForMechanistic interpretability, TransformerLens library, educational content
2.6k words · 16 backlinks

Quick Assessment

AspectAssessment
Primary RoleSenior Research Scientist and Mechanistic Interpretability Team Lead, Google DeepMind (2023–present)
Key ContributionsCreator of TransformerLens (∼3,100 GitHub stars, 112+ contributors); co-author of "A Mathematical Framework for Transformer Circuits" (2021); lead on Gemma Scope (400+ open sparse autoencoders); ICLR 2023 Spotlight for grokking paper
Key Publications"A Mathematical Framework for Transformer Circuits" (2021); "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023 Spotlight); "Towards Principled Evaluations of Sparse Autoencoders (SAEs)" (ICLR 2025); "SAEBench" (ICML 2025)
Institutional AffiliationGoogle DeepMind; previously Anthropic (2021–2022)
Influence on AI SafetyBuilds open-source tooling and training curricula for mechanistic interpretability; has mentored approximately 50 junior researchers, with 7 subsequently placed at major AI companies; named to MIT Technology Review Innovators Under 35 (2025)

Overview

Neel Nanda is a Senior Research Scientist and Mechanistic Interpretability Team Lead at Google DeepMind, where he leads research into reverse-engineering neural networks to understand how they implement algorithms. He studied Mathematics at Trinity College, Cambridge (2017–2020), then worked at Anthropic as a language model interpretability researcher under Chris Olah (2021–2022), before joining DeepMind's mechanistic interpretability team in 2023.12

He is the creator of TransformerLens, an open-source library that has become a widely used tool in the interpretability research community, and a co-author of "A Mathematical Framework for Transformer Circuits," an influential Anthropic paper introducing a mathematical vocabulary for analyzing transformer behavior. His team at DeepMind produced Gemma Scope, a collection of over 400 openly released sparse autoencoders for Gemma 2 models.3 As of 2025, Nanda had accrued over 13,000 citations on Google Scholar, an h-index of 31, and had co-authored papers appearing at NeurIPS, ICLR, and ICML.45

Nanda has also been active in training the next generation of interpretability researchers, having mentored approximately 50 junior researchers through programs including SERI MATS, with 15 co-authored papers published at top ML venues and 7 mentees subsequently placed at major AI companies.6

Background

Neel Nanda studied Mathematics at Trinity College, Cambridge from 2017 to 2020.1 Following graduation, he joined Anthropic in 2021 as a researcher in the interpretability team, working under Chris Olah on language model interpretability.2 He left Anthropic in 2022 and did a period of independent mechanistic interpretability research before joining Google DeepMind's mechanistic interpretability team in 2023.12

At DeepMind, Nanda joined expecting to be an individual contributor. When the prior team lead stepped down, he took on the role of team lead despite having no prior management experience.78 As of 2025, he holds the title of Senior Research Scientist and Mechanistic Interpretability Team Lead.47

Nanda maintains an active presence in the AI alignment research community through LessWrong and the Alignment Forum, where he publishes tutorials and research updates.

Research Contributions

TransformerLens Library

Nanda created TransformerLens, an open-source Python library for interpretability research on transformer models.9 The library was originally created in 2022 and is now maintained by a broader contributor community under the TransformerLensOrg GitHub organization, with Bryce Meyer as a key maintainer.10

As of 2025, the repository has approximately 3,100 GitHub stars, 513 forks, and 112 contributors, with 796 closed pull requests.10 The library provides:

  • Programmatic access to model activations at each layer
  • Hook functions for intervention experiments
  • Support for loading 50+ open-source language models from Hugging Face
  • Visualization utilities for attention patterns

The library is formally cited as: Nanda, N. & Bloom, J. (2022). TransformerLens.10 A companion CircuitsVis repository provides additional visualization tools and has accrued 297 stars and 40 forks.11 Version 2.0 removed Hooked SAE functionality, which was migrated to the separate SAELens library.10

Alternative interpretability libraries exist (e.g., baukit, pyvene, nnsight), though TransformerLens's hook-based API and extensive model support have made it a common choice for circuit-analysis research.

Transformer Circuits Research

Nanda is a co-author of "A Mathematical Framework for Transformer Circuits" (2021), published by the Anthropic interpretability team.12 The paper lists Nelson Elhage, Neel Nanda, and Catherine Olsson as core research contributors, with additional contributors including Tom Henighan, Nicholas Joseph, Ben Mann, and others; Chris Olah handled correspondence.13 Nanda has described it as "the coolest paper I've ever had the privilege of working on" and "the most awesome paper I've been involved in."14

The research:

  • Introduced a mathematically equivalent framework for understanding transformer operations
  • Identified "induction heads" as circuits that enable in-context learning
  • Demonstrated that attention mechanisms compose to perform multi-step reasoning
  • Showed that induction heads only develop in models with at least two attention layers
  • Provided mathematical descriptions of how models track positional and semantic information

The paper built on earlier work from Anthropic's interpretability team examining circuits in vision models.

Grokking and Modular Arithmetic

Nanda co-authored "Progress Measures for Grokking via Mechanistic Interpretability" (January 2023, with Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt), which received an ICLR 2023 Spotlight designation.15 The paper investigates the "grokking" phenomenon—a delayed generalization observed in small transformers trained on modular arithmetic—and fully reverse-engineers the algorithm learned, using discrete Fourier transforms and trigonometric identities.15

Additional Research Areas

Nanda has published work on:

  • Indirect Object Identification: The paper "Interpretability in the Wild" (from Redwood Research, with Kevin Wang, Arthur Conmy, and Alexandre Variengien) reverse-engineers a 26-head circuit in GPT-2 Small for identifying indirect objects in sentences (e.g., predicting "Mary" in "After John and Mary went to the shops, John gave a bottle of milk to ___"). Nanda hosted a multi-part walkthrough of this paper with its authors and used it extensively as a case study in his research agenda posts.16
  • Sparse Autoencoders (SAEs): Nanda's DeepMind team has produced a series of SAE papers including Gated SAEs (NeurIPS 2024, led by Sen Rajamanoharan)17, JumpReLU SAEs17, and SAEBench (ICML 2025)18, as well as "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control" (ICLR 2025)18.
  • Gemma Scope: A collection of over 400 freely available open SAEs for Gemma 2 9B and Gemma 2 2B, authored with Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar, and others.3 The goal was to enable the broader safety research community to explore model internals using open-source models and tools.

Publication Record

As of 2025, Nanda's Google Scholar profile lists over 13,000 total citations and identifies him as Mechanistic Interpretability Team Lead at Google DeepMind.4 His Semantic Scholar profile lists 62 publications, an h-index of 31, and 7,759 total citations including 880 highly influential citations.5 Peer-reviewed venues where his work has appeared include:

VenueYearPaper
ICLR (Spotlight)2023Progress Measures for Grokking via Mechanistic Interpretability
ICML2023A Toy Model of Universality
NeurIPS2024Gated SAEs (led by Sen Rajamanoharan)
ICLR2025Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
ICLR2025Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
ICML2025SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
ICML2025Learning Multi-Level Features with Matryoshka Sparse Autoencoders
ICML2025Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Sources: DBLP18, Semantic Scholar5

Educational Content

Nanda publishes interpretability tutorials and explanations on his blog at neelnanda.io and through video content. His "200 Concrete Open Problems in Mechanistic Interpretability" sequence, published on the Alignment Forum from December 28, 2022 through January 19, 2023, outlines research directions for the field.1920 The sequence covers topics including: Concrete Steps to Get Started, The Case for Analysing Toy Language Models, Looking for Circuits in the Wild, Interpreting Algorithmic Problems, Exploring Polysemanticity and Superposition, Analysing Training Dynamics, Techniques Tooling and Automation, Image Model Interpretability, Interpreting Reinforcement Learning, and Studying Learned Features in Language Models.20 A Manifold Markets prediction market tracking progress on these problems resolved at 14% solved by 2025.21

Nanda has written guides for newcomers to interpretability research, including walkthroughs of TransformerLens usage and explanations of foundational papers. These materials are distributed as blog posts, Jupyter notebooks, and video tutorials.

His materials have been incorporated into the ARENA (Alignment Research Engineer Accelerator) curriculum, a 4–5 week ML bootcamp focused on AI safety. The interpretability chapter of ARENA is largely based on Nanda's open-source material; during ARENA 2.0, 6 of 15 participants in Nanda's SERI MATS cohort rated ARENA content among their top 3 most helpful resources.22 ARENA has continued to run annually (ARENA 7.0 ran January–February 2026), citing the work of Nanda and the Google DeepMind Interpretability Team in its interpretability curriculum.23

Through the SERI MATS mentorship program, Nanda has mentored approximately 50 junior researchers over roughly three years, supervised 30+ MATS papers, and co-authored 15 papers published at top ML venues. As of his 2024 MATS applications post, 7 mentees had gone on to work at major AI companies.6

Research Approach

Nanda's work emphasizes:

  • Circuit discovery: Identifying specific subnetworks responsible for model behaviors
  • Mechanistic explanations: Describing algorithms implemented by neural networks in mathematical or computational terms
  • Open-source tooling: Building software infrastructure to enable interpretability experiments
  • Reproducible examples: Providing code and notebooks that others can run and modify

His stated view is that understanding neural network internals is necessary for evaluating AI safety properties, though he acknowledges that current interpretability techniques may not directly transfer to more capable future systems.24 Since approximately mid-2024, Nanda has described a strategic shift: his team's focus has moved from basic science (ambitious reverse-engineering of circuits) toward applying interpretability skills to real-world safety problems such as model monitoring.25 He has stated that the team "does not particularly care" whether it is called the "Mechanistic Interpretability Team"—the priority is applying available skills to the most impactful safety problems.25

Perspectives on Interpretability and Alignment

In posts and presentations, Nanda has outlined reasons why mechanistic interpretability research may contribute to AI alignment. He attributes four potential applications to interpretability:

  1. Failure diagnosis: Identifying the mechanisms behind unexpected model behaviors
  2. Capability evaluation: Determining what tasks models can perform by examining their internal algorithms
  3. Deception detection: Searching for representations that indicate models are optimizing for objectives different from their training signal
  4. Verification: Checking whether specific safety properties hold in a model's learned algorithms

He notes that these applications remain speculative and that the field is in early stages of developing techniques that scale to frontier models.26 Researchers have raised specific questions about whether circuit-level analysis scales to deception detection in practice; a LessWrong critique notes that "when researchers claim to identify a subnetwork performing a specific task, we should be very cautious — it's easy to find circuits in networks that do arbitrary things without doing anything more impressive than performance-guided network compression."27 Nanda engaged with that critique and is thanked in the post for correcting a mistake in an earlier draft.27

Nanda stated in 2025 that there have been "very exciting, widely popularized results from mechanistic interpretability that subsequent research has shown are either wrong or significantly weaker than originally claimed," and that the most ambitious vision of the field is "probably dead"—he no longer sees a path to deeply and reliably understanding what AI systems are thinking before competitive pressures push deployment of superhuman AI.28 He now describes interpretability as functioning like "model biology"—a tool for determining whether a model's behavior warrants concern—and advocates for a "Swiss cheese" approach of layering multiple safeguards rather than relying on any single alignment technique.28

On AI Timelines, Nanda has stated a deliberate policy of not publicly disclosing his p(doom) or specific timeline estimates, citing concern that prominent researchers' views cause others to anchor without examining whether interpretability expertise translates to timeline expertise.8 He has said he considers AGI arriving in the next 10–20 years, sooner, or later as all plausible scenarios, and argues that safety work should proceed as though all are realistic.8

Current Work

At Google DeepMind, Nanda holds the title of Senior Research Scientist and Mechanistic Interpretability Team Lead.74 Team outputs during his tenure include Gemma Scope (400+ open-weight SAEs released July 2024)3, Gated SAEs (NeurIPS 2024)17, JumpReLU SAEs17, and SAEBench (ICML 2025)18. Recent papers from the team also include "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" and "Learning Multi-Level Features with Matryoshka Sparse Autoencoders" (both 2025).18

In 2025, Nanda was named to MIT Technology Review's Innovators Under 35 list for his mechanistic interpretability research and work to build the field.29

Limitations and Open Questions

Nanda has identified challenges for mechanistic interpretability research:30

  • Current techniques are labor-intensive and may not scale to models with hundreds of billions of parameters
  • Many discovered circuits are descriptive rather than predictive, explaining behavior post-hoc without enabling intervention
  • It remains unclear whether interpretability of current models will provide insights applicable to future AI systems with different architectures or training methods

External critiques of the broader research program include the concern that circuits identified through activation-based methods may not correspond to causally meaningful computational units, and that the dominant focus on circuits-style interpretability narrows the field's approach to model verification and deceptive alignment detection.27 Nanda himself has acknowledged that some earlier high-profile results in the field were "wrong or significantly weaker than originally claimed."28

Questions about whether interpretability should be prioritized relative to other AI Alignment such as scalable oversight or Constitutional AI remain open; Nanda argues for treating these as complementary rather than competing.28

Footnotes

  1. Neel Nanda, OpenReview Profile. https://openreview.net/profile?id=~Neel_Nanda1 2 3

  2. Neel Nanda. "About." neelnanda.io. https://www.neelnanda.io/about 2 3

  3. Google DeepMind. "Gemma Scope: Helping the Safety Community Shed Light on the Inner Workings of Language Models." DeepMind Blog, July 2024. https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/ 2 3

  4. Neel Nanda, Google Scholar. https://scholar.google.com/citations?user=GLnX3MkAAAAJ&hl=en 2 3 4

  5. Neel Nanda, Semantic Scholar. https://www.semanticscholar.org/author/Neel-Nanda/2051128902 2 3

  6. Neel Nanda. "MATS Applications." neelnanda.io, 2024. https://www.neelnanda.io/blog/mats-apps-9 2

  7. Neel Nanda, LinkedIn. https://uk.linkedin.com/in/neel-nanda%F0%9F%94%B8-993580151 2 3

  8. Rob Wiblin / 80,000 Hours. "Neel Nanda on leading a Google DeepMind team at 26 (Part 2)." 80,000 Hours Podcast, October 21, 2025. https://80000hours.org/podcast/episodes/neel-nanda-career-advice-frontier-ai-companies/ 2 3

  9. TransformerLensOrg/TransformerLens, GitHub. https://github.com/TransformerLensOrg/TransformerLens

  10. TransformerLensOrg/TransformerLens, GitHub. https://github.com/TransformerLensOrg/TransformerLens 2 3 4

  11. TransformerLens Official Documentation. https://transformerlensorg.github.io/TransformerLens/

  12. Elhage, Nanda, Olsson et al. "A Mathematical Framework for Transformer Circuits." Transformer Circuits Thread, December 22, 2021. https://transformer-circuits.pub/2021/framework/index.html

  13. Elhage, Nanda, Olsson et al. "A Mathematical Framework for Transformer Circuits." December 22, 2021. https://transformer-circuits.pub/2021/framework/index.html

  14. Neel Nanda. "A Walkthrough of A Mathematical Framework for Transformer Circuits." LessWrong, October 14, 2022. https://www.lesswrong.com/posts/hBtjpY2wAASEpZXgN/a-walkthrough-of-a-mathematical-framework-for-transformer

  15. Nanda, Chan, Lieberum, Smith, Steinhardt. "Progress Measures for Grokking via Mechanistic Interpretability." arXiv:2301.05217, January 2023. https://arxiv.org/abs/2301.05217 2

  16. Neel Nanda. "A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)." LessWrong, 2022. https://www.lesswrong.com/posts/DZk6mRo9vhCXN9Rfn/a-walkthrough-of-interpretability-in-the-wild-w-authors

  17. MIT Technology Review. "Google DeepMind Has a New Way to Look Inside an AI's 'Mind'." November 14, 2024. https://www.technologyreview.com/2024/11/14/1106871/google-deepmind-has-a-new-way-to-look-inside-an-ais-mind/ 2 3 4

  18. Neel Nanda, DBLP. https://dblp.org/pid/285/6389.html 2 3 4 5

  19. Neel Nanda. "200 Concrete Open Problems in Mechanistic Interpretability." AI Alignment Forum sequence, December 28, 2022 – January 19, 2023. https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj

  20. Neel Nanda. "200 Concrete Open Problems in Mechanistic Interpretability." AI Alignment Forum sequence, December 28, 2022 – January 19, 2023. https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj 2

  21. Manifold Markets. "By 2025, Percent of 200 Concrete Open Problems Solved?" https://manifold.markets/duck_master/by-2025-percent-of-200-concrete-ope

  22. ARENA team. "ARENA 2.0 - Impact Report." EA Forum, September 26, 2023. https://forum.effectivealtruism.org/posts/C7DbrkCpSe4AdcMek/arena-2-0-impact-report

  23. ARENA team. "ARENA 7.0 - Call for Applicants." EA Forum, September 30, 2025. https://forum.effectivealtruism.org/posts/MDw6QDyrLP3kpehd2/arena-7-0-call-for-applicants

  24. Neel Nanda. "A Pragmatic Vision for Interpretability." LessWrong, 2024. https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability

  25. Neel Nanda. "A Pragmatic Vision for Interpretability." LessWrong, 2024. https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability 2

  26. Neel Nanda. "200 Concrete Open Problems in Mechanistic Interpretability." AI Alignment Forum sequence. https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj

  27. LessWrong. "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety." https://www.lesswrong.com/posts/wt7HXaCWzuKQipqz3/eis-vi-critiques-of-mechanistic-interpretability-work-in-ai 2 3

  28. Rob Wiblin / 80,000 Hours. "Neel Nanda on the Race to Read AI Minds (Part 1)." 80,000 Hours Podcast, October 14, 2025. https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/ 2 3 4

  29. MIT Technology Review. "Neel Nanda." Innovators Under 35, 2025. https://www.technologyreview.com/innovator/neel-nanda/

  30. Neel Nanda, 80,000 Hours Podcast Part 1. "Neel Nanda on the Race to Read AI Minds." October 14, 2025. https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/

References

Neel Nanda's personal research homepage focused on mechanistic interpretability of neural networks, aiming to reverse-engineer how transformers and other models implement algorithms internally. His work includes foundational contributions like the discovery of grokking phenomena, superposition in neural networks, and developing TransformerLens, a key tool for interpretability research.

Citation verification: 2 verified of 4 total

Structured Data

4 facts·1 recordView in FactBase →
Employed By
Google DeepMind
as of Jan 2023
Role / Title
Research Scientist, Google DeepMind
as of Jan 2023
Birth Year
1999

All Facts

4
People
PropertyValueAs OfSource
Employed ByGoogle DeepMindJan 2023
Role / TitleResearch Scientist, Google DeepMindJan 2023
Biographical
PropertyValueAs OfSource
Notable ForMechanistic interpretability; TransformerLens library; educational contentMar 2026
Birth Year1999

Career History

1
OrganizationTitleStart
Google DeepMindResearch Scientist2023-01

Related Wiki Pages

Top Related Pages

Concepts

AI TimelinesSimilar Projects

Approaches

Sparse Autoencoders (SAEs)Constitutional AIAgent Foundations

Other

Connor LeahyScalable OversightChris Olah

Organizations

Redwood ResearchAnthropicGoodfireMATS ML Alignment Theory Scholars programCoefficient GivingManifund

Analysis

Model Organisms of Misalignment

Policy

AI Whistleblower Protections

Key Debates

Technical AI Safety Research

Safety Research

Anthropic Core Views