Neel Nanda on the race to read AI minds (part 1) | 80,000 Hours
webCredibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: 80,000 Hours
Metadata
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Mechanistic Interpretability | Research Area | 59.0 |
Cached Content Preview
HTTP 200Fetched May 17, 202698 KB
## On this page:
- [1 Introduction](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#top)
- [1.1 The episode in a nutshell](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#summary)
- [2 Highlights](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#highlights)
- [3 Articles, books, and other media discussed in the show](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#articles-books-and-other-media-discussed-in-the-show)
- [4 Transcript](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#transcript)
- [4.1 Cold open \[00:00:00\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#cold-open-000000)
- [4.2 Who's Neel Nanda? \[00:01:02\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#whos-neel-nanda-000102)
- [4.3 How would mechanistic interpretability help with AGI \[00:01:59\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#how-would-mechanistic-interpretability-help-with-agi-000159)
- [4.4 What's mech interp? \[00:05:09\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#whats-mech-interp-000509)
- [4.5 How Neel changed his take on mech interp \[00:09:47\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#how-neel-changed-his-take-on-mech-interp-000947)
- [4.6 Top successes in interpretability \[00:15:53\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#top-successes-in-interpretability-001553)
- [4.7 Probes can cheaply detect harmful intentions in AIs \[00:20:06\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#probes-can-cheaply-detect-harmful-intentions-in-ais-002006)
- [4.8 In some ways we understand AIs better than human minds \[00:26:49\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#in-some-ways-we-understand-ais-better-than-human-minds-002649)
- [4.9 Mech interp won't solve all our AI alignment problems \[00:29:21\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#mech-interp-wont-solve-all-our-ai-alignment-problems-002921)
- [4.10 Why mech interp is the 'biology' of neural networks \[00:38:07\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#why-mech-interp-is-the-biology-of-neural-networks-003807)
- [4.11 Interpretability can't reliably find deceptive AI — nothing can \[00:40:28\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#interpretability-cant-reliably-find-deceptive-ai-nothing-can-004028)
- [4.12 'Black box' interpretability: reading the chain of thought \[00:49:39\]](https://80000hours.org/podcast/episodes/neel-nanda-mechanistic-interpretability/#black-box-interpretability-reading-the-ch
... (truncated, 98 KB total)Resource ID:
8fe85c0841f301a8 | Stable ID: sid_fNF0adi83g