The Urgency of Interpretability - Dario Amodei

web

darioamodei.com·darioamodei.com/post/the-urgency-of-interpretability

Metadata

Cited by 3 pages

Page	Type	Quality
Safety-Capability Tradeoff Model	Analysis	64.0
Interpretability	Research Area	66.0
Mechanistic Interpretability	Research Area	59.0

Cached Content Preview

HTTP 200Fetched May 17, 202640 KB

# The Urgency of Interpretability

April 2025

In the decade that I have been working on AI, I’ve watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world.  In all that time, perhaps the most important lesson I’ve learned is this: the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the _way_ in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so.  We can’t _stop_ the bus, but we can _steer_ it.  In the past I’ve written about the importance of deploying AI in a way that is [positive for the world](https://darioamodei.com/essay/machines-of-loving-grace), and of ensuring that democracies build and wield the technology [before autocracies do](https://darioamodei.com/post/on-deepseek-and-export-controls).  **Over the last few months, I have become increasingly focused on an additional opportunity for steering the bus: the tantalizing possibility, opened up by some recent advances, that we could succeed at _interpretability_—that is, in understanding the inner workings of AI systems— _before_ models reach an overwhelming level of power.**

People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work.  They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.  For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model.  This goal has often felt very distant, but multiple [recent](https://www.anthropic.com/research/auditing-hidden-objectives) [breakthroughs](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) have convinced me that we are now on the right track and have a real chance of success.

At the same time, the field of AI as a whole is further ahead than our efforts at interpretability, and is itself advancing very quickly.  We therefore must move fast if we want interpretability to mature in time to matter.  This post makes the case for interpretability: what it is, why AI will go better if we have it, and what all of us can do to help it win the race.

## The Dangers of Ignorance

Modern generative AI systems are opaque in a way that fundamentally differs from traditional software.  If an ordinary software program does something—for example, a character in a video game says a line of dialogue, or my food delivery app allows me to tip my driver—it does those things because a human specifically programmed them in.  Generative AI is _not like that at all_.  When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or p

... (truncated, 40 KB total)

Resource ID: f19373dc25a8a96f | Stable ID: sid_TVXgUb5wXg