Alignment Forum: It Is Reasonable To Research How To Use Model Internals In Training

blog

Alignment Forum·alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonab...

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

Written by Neel Nanda (DeepMind interpretability researcher), this post addresses internal alignment community debates about which research directions are safe to pursue, specifically defending interpretability-informed training methods as a legitimate area of study.

Metadata

Importance: 58/100blog postcommentary

Summary

Neel Nanda defends the legitimacy of research into using model internals (interpretability techniques) during training as a valuable AI safety direction, pushing back against community sentiment that treats it as a 'forbidden technique.' He argues the research is necessary to evaluate feasibility and safety implications, while noting it remains practically distant from frontier model deployment.

Key Points

•Research on model internals during training covers a spectrum: probe scores in loss functions, activation steering during training, and related interpretability-informed training methods.
•Nanda disputes the framing that this is a 'forbidden technique,' arguing the community needs to study it to understand risks and benefits.
•The techniques are currently impractical for frontier training stacks, making this a longer-term research direction rather than an immediate deployment risk.
•Uncertainty about net positive or negative impact on safety does not justify avoiding research—it justifies doing the research to find out.
•The post reflects broader debate in the alignment community about which research directions are safe or beneficial to pursue.

Cited by 1 page

Page	Type	Quality
Goodfire	Organization	68.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202626 KB

[It Is Reasonable To Research How To Use Model Internals In Training](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#)

5 min read

•

[What do I actually mean by using interp in training?](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#What_do_I_actually_mean_by_using_interp_in_training_)

•

[Why could this research help AGI safety?](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#Why_could_this_research_help_AGI_safety_)

•

[Why might you be worried about research here?](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#Why_might_you_be_worried_about_research_here_)

•

[This gets kinda nuanced](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#This_gets_kinda_nuanced)

•

[Why do I want to see more work here?](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#Why_do_I_want_to_see_more_work_here_)

[AI](https://www.alignmentforum.org/w/ai)
Frontpage

# 36

# [It Is Reasonable To Research How To Use Model Internals InTraining](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in)

by [Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1?from=post_header)

7th Feb 2026

5 min read

[15](https://www.alignmentforum.org/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in#comments)

# 36

There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “[the most forbidden technique](https://www.alignmentforum.org/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique)”, including recent criticism of Goodfire for [investing in this area](https://www.goodfire.ai/blog/intentional-design).

I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have [worked](https://arxiv.org/abs/2507.16795)[on it](https://www.alignmentforum.org/posts/R5MdWGKsuvdPwGFBG/steering-rl-training-benchmarking-interventions-against), Anthropic Fellows have [worked on it](https://arxiv.org/abs/2507.21509), FAR has [worked on it](https://arxiv.org/pdf/2505.13787), etc.

I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required research to figure this out. Further, this seems like a massive pain to use in current frontier model training stacks, so I don’t see any imminent risk/benefit from this kind of work, it’s more of a longer-term bet.

My sense is that most people in this debate have pretty reasonable and nuanced views when yo

... (truncated, 26 KB total)

Resource ID: ecdf62cc36997da2 | Stable ID: sid_BA0e2gLvP5