Assessing skeptical views of interpretability research | Christopher Potts

web

web.stanford.edu·web.stanford.edu/~cgpotts/blog/interp/

A 2025 position piece by Stanford NLP professor Christopher Potts, prepared for an Anthropic/Goodfire interpretability workshop, useful for understanding the current debate about interpretability's legitimacy and future within the broader AI research community.

Metadata

Importance: 62/100blog postcommentary

Summary

Christopher Potts (Stanford/Goodfire) systematically addresses common dismissive arguments against interpretability research, arguing that such dismissal is historically ill-conceived given how many once-marginal AI areas became central to the field. He counters three main skeptical positions—that interpretability is fundamentally unachievable, that analysis is overrated in engineering, and that interpretability lacks practical utility—advocating for claim-by-claim engagement rather than wholesale dismissal.

Key Points

•Historical precedent (neural nets, NLP, RL) shows dismissing emerging AI research areas is repeatedly wrong and potentially harmful to the field.
•The claim that interpretability is fundamentally unachievable ignores that neural networks are closed, deterministic, human-designed systems—arguably easier to understand than biological systems.
•The 'engineering over analysis' argument is rebutted by noting that understanding system internals often yields better engineering outcomes and that AI is increasingly a science as well as an engineering discipline.
•Interpretability has growing practical utility for safety, debugging, and capability improvement, making purely skeptical dismissal increasingly untenable.
•Written as a discussion document for the Goodfire/Anthropic 'Interpretability: the next 5 years' meet-up, reflecting active industry and academic engagement with the field's direction.

Cited by 2 pages

Page	Type	Quality
Goodfire	Organization	68.0
Chris Olah	Person	27.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202613 KB

Assessing skeptical views of interpretability research | Christopher Potts 
 
 
 
 
 
 
 Credit: Tom Brink 
 By Christopher Potts – August 8, 2025 

 Goodfire and Anthropic have jointly organized a meet-up of academic and industry researchers called “Interpretability: the next 5 years”, to be held later this month. Participants have been invited to contribute short discussion documents. This is a draft of my document, which I am posting publicly to try to stimulate discussion in the broader community.

 It’s an awkward time for interpretability research in AI. On the one hand, the pace of technical innovation has been incredible, and the number of success stories is rising fast. On the other hand, I think all of us in the field are accustomed to responses to our work that range from skepticism to dismissal.

 The dismissals still surprise me. I assure you I am not that old, but I have still seen many areas shift from being perceived as obscure, irrelevant dead-ends to the thing defining the future. The most prominent example is neural network research itself – once mocked as a dead-end, now the life-blood of the field. The most extreme and sudden shift is the task of natural language generation; this used to be a niche topic, and now it practically defines AI itself. All the tasks grouped under the heading of “Natural Language Understanding” similarly shifted from the periphery to the mainstream over the last 10 years. The most recent sea change is for reinforcement learning, which went from deeply unfashionable to the hottest thing seemingly overnight in 2022 .

 Having lived through all these transitions, I am really shy about dismissing anything out of hand. I worry about the damage I could do by steering people away from what turn out to be significant topics. It seems much wiser to proceed claim by claim than to brush aside entire areas.

 So, outright dismissal of interpretability research seems ill-considered. Skepticism is healthy, though, and it will pay to understand the driving forces behind this skepticism. Here I will try to articulate the skeptical positions I often encounter and provide my current responses to them.

 

 Interpretability cannot be achieved in any meaningful sense

 This claim usually stems from the position that there is an inherent conflict between quality and interpretability, for neural networks in particular or as a fact about learning and complexity in general. It follows from this assumption that our best models will be uninterpretable. A similar position holds that, as systems become more complex, their faithful explanations lawfully become more complex as well, at such a rate that the explanations become useless to us. These positions (and some persuasive replies) are prominent in this famous debate .

 In response to this skeptical take, I would observe that neural networks are closed, deterministic systems that we designed and built ourselves. On the face of it, the project of understanding them should be

... (truncated, 13 KB total)

Resource ID: 27495e9d5c0d4e56 | Stable ID: sid_N518D6g74A