Skip to content
Longterm Wiki

Neel Nanda on the race to read AI minds (part 1) | 80,000 Hours

web

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: 80,000 Hours

Metadata

Cited by 1 page

PageTypeQuality
Mechanistic InterpretabilityResearch Area59.0

Cached Content Preview

HTTP 200Fetched May 17, 202698 KB
## On this page:

- [1 Introduction](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#top)
  - [1.1 The episode in a nutshell](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#summary)
- [2 Highlights](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#highlights)
- [3 Articles, books, and other media discussed in the show](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#articles-books-and-other-media-discussed-in-the-show)
- [4 Transcript](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#transcript)
  - [4.1 Cold open \[00:00:00\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#cold-open-000000)
  - [4.2 Who's Neel Nanda? \[00:01:02\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#whos-neel-nanda-000102)
  - [4.3 How would mechanistic interpretability help with AGI \[00:01:59\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#how-would-mechanistic-interpretability-help-with-agi-000159)
  - [4.4 What's mech interp? \[00:05:09\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#whats-mech-interp-000509)
  - [4.5 How Neel changed his take on mech interp \[00:09:47\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#how-neel-changed-his-take-on-mech-interp-000947)
  - [4.6 Top successes in interpretability \[00:15:53\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#top-successes-in-interpretability-001553)
  - [4.7 Probes can cheaply detect harmful intentions in AIs \[00:20:06\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#probes-can-cheaply-detect-harmful-intentions-in-ais-002006)
  - [4.8 In some ways we understand AIs better than human minds \[00:26:49\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#in-some-ways-we-understand-ais-better-than-human-minds-002649)
  - [4.9 Mech interp won't solve all our AI alignment problems \[00:29:21\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#mech-interp-wont-solve-all-our-ai-alignment-problems-002921)
  - [4.10 Why mech interp is the 'biology' of neural networks \[00:38:07\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#why-mech-interp-is-the-biology-of-neural-networks-003807)
  - [4.11 Interpretability can't reliably find deceptive AI — nothing can \[00:40:28\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#interpretability-cant-reliably-find-deceptive-ai-nothing-can-004028)
  - [4.12 'Black box' interpretability: reading the chain of thought \[00:49:39\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#black-box-interpretability-reading-the-ch

... (truncated, 98 KB total)
Resource ID: 8fe85c0841f301a8 | Stable ID: sid_fNF0adi83g