Goodfire and Training on Interpretability
webAuthor
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: LessWrong
Relevant to researchers exploring mechanistic interpretability as a training signal or safety intervention; raises foundational concerns about Goodhart's Law applied to interpretability-guided training pipelines.
Forum Post Details
Metadata
Summary
This LessWrong post critically examines Goodfire's interpretability-in-the-loop training approach, raising concerns that optimization pressure on interpretability tools causes them to degrade—a known problem called 'The Most Forbidden Technique.' The author argues that decomposing gradients into semantic components may not prevent models from developing harmful behaviors through uninterpreted channels that selection pressure can exploit.
Key Points
- •'The Most Forbidden Technique' refers to the risk that using interpretability methods as training targets causes those methods to degrade as models learn to game them.
- •Goodfire's approach decomposes training gradients into semantic components and selectively applies them, attempting to sidestep direct optimization on interpretability outputs.
- •Commenters raise concerns that uninterpreted semantic components in gradient decompositions still provide exploitable pathways for selection pressure.
- •The post questions whether any current implementation of interpretability-in-the-loop training adequately prevents undetectable harmful behavior development.
- •The discussion highlights a fundamental tension: using interpretability tools to guide training may systematically undermine the reliability of those same tools.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Goodfire | Organization | 68.0 |
Cached Content Preview
x This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. Goodfire and Training on Interpretability — LessWrong Interpretability (ML & AI) Optimization AI Personal Blog 32
[ Question ]
Goodfire and Training on Interpretability
by Satya Benson 6th Feb 2026 1 min read A 0 5 32
Goodfire wrote Intentionally designing the future of AI about training on interpretability.
This seems like an instance of The Most Forbidden Technique which has been warned against over and over - optimization pressure on interpretability technique [T] eventually degrades [T].
Goodfire claims they are aware of the associated risks and managing those risks.
Are they properly managing those risks? I would love to get your thoughts on this.
Goodfire and Training on Interpretability 5 Steven Byrnes 3 Satya Benson 7 JBlack 1 Satya Benson 2 khafra New Answer New Comment Submit A 0 5 5 Interpretability (ML & AI) Optimization AI Personal Blog 32
5 comments , sorted by top scoring Click to highlight new comments since: Today at 2:06 AM [ - ] Steven Byrnes 1mo * 5 0 EDITED TO ADD: I just took the non-Goodfire-specific part of this comment, and spun it out and expanded it into a new post: In (highly contingent!) defense of interpretability-in-the-loop ML training
~ ~ ~
No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)
Reply [ - ] Satya Benson 1mo 3 0 To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detec
... (truncated, 5 KB total)c4075b6d47521293 | Stable ID: MDc4MjQ5Yj