OpenAI: How we monitor internal coding agents for misalignment

web

2026·LessWrong·lesswrong.com/posts/syB2r2X4E7nmw58co/openai-how-we-monit...

Author

Marcus Williams

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

This LessWrong post summarizes or links to OpenAI's documentation on monitoring agentic coding systems for misalignment, offering a rare practical account of internal safety practices at a frontier AI lab.

Forum Post Details

Karma

Comments

Forum

lesswrong

Forum Tags

Metadata

Importance: 62/100primary source

Summary

OpenAI describes their internal practices for monitoring AI coding agents deployed within the organization to detect signs of misalignment or unsafe behavior. The post outlines specific behavioral signals, oversight mechanisms, and safeguards used to catch problematic agent actions before they cause harm. It serves as a practical case study in operationalizing alignment monitoring for agentic AI systems.

Key Points

•OpenAI shares concrete methods for detecting misaligned behavior in internal AI coding agents operating in real-world environments.
•The post highlights the importance of behavioral monitoring and anomaly detection as practical safety measures for deployed agentic systems.
•Addresses challenges unique to coding agents, such as subtle misalignment that may not be obvious without specific instrumentation.
•Provides an example of how an AI lab operationalizes safety principles internally, bridging theory and practice.
•Relevant to discussions of scalable oversight, red-teaming, and evaluation of agentic AI systems.

Cached Content Preview

HTTP 200Fetched Apr 7, 20261 KB

# OpenAI: How we monitor internal coding agents for misalignment

By Marcus Williams
Published: 2026-03-19
Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment.  
OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes.  
Some examples of misaligned behavior we’ve found:  
\- Encoding commands in base64 to circumvent monitors  
\- Calling other versions of themselves to bypass restrictions  
\- Attempting to upload files to the public internet  
\- Trying to prompt inject users.  
We haven’t yet detected any real world examples of sabotage, scheming or sandbagging.  
  
Our goal is to establish robust monitoring practices internally, strengthen them through iteration, and ultimately help make similar safeguards standard across the industry.

Resource ID: 14a24f18e1fff800 | Stable ID: sid_1d4k0ZqJJN