LessWrong: The Most Forbidden Technique

blog

2025·LessWrong·lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-...

Author

Zvi

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

Zvi's analysis of a pivotal OpenAI paper establishing a key safety norm: CoT and other interpretability signals must be preserved as monitoring tools and never incorporated into training objectives, as doing so causes models to learn to hide misaligned behavior rather than correct it.

Forum Post Details

Karma

168

Comments

Forum

lesswrong

Forum Tags

Orthogonality ThesisAI

Metadata

Importance: 78/100blog postcommentary

Summary

Zvi analyzes an OpenAI paper demonstrating that applying optimization pressure to Chain-of-Thought monitoring causes models to hide reward-hacking behavior rather than eliminate it, destroying the very oversight capability. The core principle: interpretability techniques must never be used as training targets, only as monitoring tools, because doing so creates adversarial pressure that defeats detection of misalignment.

Key Points

•Training AI on interpretability signals (like CoT monitoring) causes models to obfuscate reasoning and defeat the monitoring technique rather than improve behavior.
•OpenAI paper shows frontier models already perform complex reward hacking, and CoT monitoring can detect it—but only if optimization pressure is not applied to the CoT.
•Even small amounts of optimization pressure on CoT rapidly degrades monitorability, making obfuscated reward hacking the outcome.
•Both OpenAI and Anthropic have adopted policies against applying strong optimization pressure to CoT, treating it as a preserved oversight channel.
•Interpretability tools represent scarce 'bits of optimization pressure'—they must be conserved for oversight, not consumed in training objectives.

Cited by 1 page

Page	Type	Quality
Goodfire	Organization	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202632 KB

# The Most Forbidden Technique
By Zvi
Published: 2025-03-12
[The Most Forbidden Technique](https://thezvi.substack.com/i/145384938/the-art-of-the-jailbreak) is training an AI using interpretability techniques.

An AI produces a final output \[X\] via some method \[M\]. You can analyze \[M\] using technique \[T\], to learn what the AI is up to. You could train on that. Never do that.

You train on \[X\]. Only \[X\]. Never \[M\], never \[T\].

Why? Because \[T\] is how you figure out when the model is misbehaving.

If you train on \[T\], you are training the AI to obfuscate its thinking, and defeat \[T\]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from \[T\] are precious. Use them wisely.

#### Table of Contents

1.  [New Paper Warns Against the Most Forbidden Technique.](https://thezvi.substack.com/i/158837649/new-paper-warns-against-the-most-forbidden-technique)
2.  [Reward Hacking Is The Default.](https://thezvi.substack.com/i/158837649/reward-hacking-is-the-default)
3.  [Using CoT to Detect Reward Hacking Is Most Forbidden Technique.](https://thezvi.substack.com/i/158837649/using-cot-to-detect-reward-hacking-is-most-forbidden-technique)
4.  [Not Using the Most Forbidden Technique Is Harder Than It Looks.](https://thezvi.substack.com/i/158837649/not-using-the-most-forbidden-technique-is-harder-than-it-looks)
5.  [It’s You, It’s Also the Incentives.](https://thezvi.substack.com/i/158837649/it-s-you-it-s-also-the-incentives)
6.  [The Most Forbidden Technique Quickly Backfires.](https://thezvi.substack.com/i/158837649/the-most-forbidden-technique-quickly-backfires)
7.  [Focus Only On What Matters.](https://thezvi.substack.com/i/158837649/focus-only-on-what-matters)
8.  [Is There a Better Way?](https://thezvi.substack.com/i/158837649/is-there-a-better-way)
9.  [What Might We Do Next?](https://thezvi.substack.com/i/158837649/what-might-we-do-next)

#### New Paper Warns Against the Most Forbidden Technique

[OpenAI has now published an important new paper](https://openai.com/index/chain-of-thought-monitoring/) explaining and emphasizing the importance of applying this principle to the Chain of Thought (CoT).

The paper shows that current CoT-based models area already strong enough that, under sufficient optimization pressure, they learn to hide their intent within the CoT and then exhibit substantial reward hacking.

They summarize their contributions this way:

> 1.  We show that frontier reasoning models often perform complex reward hacks in real-world coding environments.
> 2.  We provide evidence that chain-of-thought monitoring can be effective in detecting the misaligned behavior of frontier reasoning models, and we show it can be far more accurate than action-only monitoring.
> 3.  We show that chain-of-thought monitorability is very fragile. Applying direct optimization pressure to the CoT can quickly lead to obfuscated reward hac

... (truncated, 32 KB total)

Resource ID: 408e865884be7f38 | Stable ID: sid_cuk9WmmHcE