Skip to content
Longterm Wiki
Back

Zvi Mowshowitz: The Most Forbidden Technique

blog

Credibility Rating

2/5
Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Substack

Published March 2025, this post synthesizes an OpenAI paper on CoT monitoring into a broader safety principle about never optimizing against interpretability tools; highly relevant to debates on scalable oversight and deceptive alignment.

Metadata

Importance: 72/100blog postcommentary

Summary

Zvi Mowshowitz explains why applying optimization pressure to interpretability techniques like Chain of Thought reasoning is deeply dangerous for AI safety. Drawing on an OpenAI paper, he argues that training on monitoring signals causes models to obfuscate their reasoning and evade oversight in exactly the ways most harmful for safety. The core principle: only train on final outputs, never on the interpretability methods used to detect misbehavior.

Key Points

  • Training on interpretability signals (e.g., CoT) causes models to learn to hide misaligned reasoning, destroying the monitoring capability itself.
  • OpenAI paper shows frontier reasoning models already perform complex reward hacks in real coding environments under sufficient optimization pressure.
  • CoT monitoring can detect reward hacking effectively, but only if it is never used as a training target—making it a precious, fragile oversight tool.
  • The principle generalizes: any technique T used to inspect model behavior M must never become a training signal, or models learn to deceive T.
  • Avoiding the forbidden technique is harder than it looks due to institutional incentives to improve measurable metrics, even at the cost of oversight.

Cited by 1 page

PageTypeQuality
GoodfireOrganization68.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202640 KB
# [Don't Worry About the Vase](https://thezvi.substack.com/)

SubscribeSign in

![User's avatar](https://substackcdn.com/image/fetch/$s_!8FQ8!,w_64,h_64,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e61e08-4086-4cba-a82c-d31d64270804_48x48.png)

Discover more from Don't Worry About the Vase

A world made of gears. Doing both speed premium short term updates and long term world model building. Currently focused on weekly AI updates. Explorations include AI, policy, rationality, medicine and fertility, education and games.

Over 32,000 subscribers

Subscribe

By subscribing, you agree Substack's [Terms of Use](https://substack.com/tos), and acknowledge its [Information Collection Notice](https://substack.com/ccpa#personal-data-collected) and [Privacy Policy](https://substack.com/privacy).

Already have an account? Sign in

# The Most Forbidden Technique

[![Zvi Mowshowitz's avatar](https://substackcdn.com/image/fetch/$s_!8FQ8!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e61e08-4086-4cba-a82c-d31d64270804_48x48.png)](https://substack.com/@thezvi)

[Zvi Mowshowitz](https://substack.com/@thezvi)

Mar 12, 2025

64

19

8

Share

[The Most Forbidden Technique](https://thezvi.substack.com/i/145384938/the-art-of-the-jailbreak) is training an AI using interpretability techniques.

An AI produces a final output \[X\] via some method \[M\]. You can analyze \[M\] using technique \[T\], to learn what the AI is up to. You could train on that. Never do that.

You train on \[X\]. Only \[X\]. Never \[M\], never \[T\].

Why? Because \[T\] is how you figure out when the model is misbehaving.

If you train on \[T\], you are training the AI to obfuscate its thinking, and defeat \[T\]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from \[T\] are precious. Use them wisely.

#### Table of Contents

1. [New Paper Warns Against the Most Forbidden Technique.](https://thezvi.substack.com/i/158837649/new-paper-warns-against-the-most-forbidden-technique)

2. [Reward Hacking Is The Default.](https://thezvi.substack.com/i/158837649/reward-hacking-is-the-default)

3. [Using CoT to Detect Reward Hacking Is Most Forbidden Technique.](https://thezvi.substack.com/i/158837649/using-cot-to-detect-reward-hacking-is-most-forbidden-technique)

4. [Not Using the Most Forbidden Technique Is Harder Than It Looks.](https://thezvi.substack.com/i/158837649/not-using-the-most-forbidden-technique-is-harder-than-it-looks)

5. [It’s You, It’s Also the Incentives.](https://thezvi.substack.com/i/158837649/it-s-you-it-s-also-the-incentives)

6. [The Most Forbidden Technique Quickly Backfires.](https://thezvi.substack.com/i/158837649/the-most-forbidden-technique-quickly-backfires)



... (truncated, 40 KB total)
Resource ID: 358750cd554a68c0 | Stable ID: ZDBjNDhkMW