Skip to content
Longterm Wiki
Back

Recent multi-lab research

paper

Authors

Tomek Korbak·Mikita Balesni·Elizabeth Barnes·Yoshua Bengio·Joe Benton·Joseph Bloom·Mark Chen·Alan Cooney·Allan Dafoe·Anca Dragan·Scott Emmons·Owain Evans·David Farhi·Ryan Greenblatt·Dan Hendrycks·Marius Hobbhahn·Evan Hubinger·Geoffrey Irving·Erik Jenner·Daniel Kokotajlo·Victoria Krakovna·Shane Legg·David Lindner·David Luan·Aleksander Mądry·Julian Michael·Neel Nanda·Dave Orr·Jakub Pachocki·Ethan Perez·Mary Phuong·Fabien Roger·Joshua Saxe·Buck Shlegeris·Martín Soto·Eric Steinberger·Jasmine Wang·Wojciech Zaremba·Bowen Baker·Rohin Shah·Vlad Mikulik

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A multi-lab collaborative paper offering practical safety recommendations on chain-of-thought monitoring; relevant to ongoing debates about scalable oversight and the reliability of reasoning transparency as a safety tool in frontier models.

Paper Details

Citations
117
6 influential
Year
2025

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Summary

This multi-lab paper argues that monitoring AI systems' chain-of-thought reasoning in human language is a promising but imperfect safety mechanism for detecting misbehavior intent. The authors recommend increased research investment in CoT monitorability and urge frontier model developers to carefully consider how their development decisions may affect the fragility of CoT monitoring as a safety tool.

Key Points

  • CoT monitoring leverages the unique property of human-language reasoning to detect AI misbehavior intent before it manifests in actions.
  • Like all known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go undetected.
  • CoT monitorability may be fragile—development choices such as training procedures could degrade its effectiveness inadvertently.
  • The authors recommend CoT monitoring be pursued alongside, not instead of, existing safety methods as a complementary layer.
  • Frontier model developers are specifically urged to evaluate how their decisions impact CoT monitorability as a safety property.

Cited by 2 pages

PageTypeQuality
Reasoning and PlanningCapability65.0
Scalable Eval ApproachesApproach65.0

Cached Content Preview

HTTP 200Fetched Mar 8, 202640 KB
[2507.11473] Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Chain of Thought Monitorability:
 A New and Fragile Opportunity for AI Safety

 
 
 
 
 
 
 
 
 
 Tomek Korbak ∗ UK AI Security Institute 
 
 
 Mikita Balesni ∗ Apollo Research 
 
 
 Elizabeth Barnes METR 
 Yoshua Bengio University of Montreal & Mila 
 
 Joe Benton Anthropic 
 Joseph Bloom UK AI Security Institute 
 
 Mark Chen OpenAI 
 Alan Cooney UK AI Security Institute 
 
 Allan Dafoe Google DeepMind 
 Anca Dragan Google DeepMind 
 
 Scott Emmons Google DeepMind 
 Owain Evans Truthful AI & UC Berkeley 
 
 David Farhi OpenAI 
 Ryan Greenblatt Redwood Research 
 
 Dan Hendrycks Center for AI Safety 
 Marius Hobbhahn Apollo Research 
 
 Evan Hubinger Anthropic 
 Geoffrey Irving UK AI Security Institute 
 
 Erik Jenner Google DeepMind 
 Daniel Kokotajlo AI Futures Project 
 
 Victoria Krakovna Google DeepMind 
 Shane Legg Google DeepMind 
 
 David Lindner Google DeepMind 
 David Luan Amazon 
 
 Aleksander Mądry OpenAI 
 Julian Michael Scale AI 
 
 Neel Nanda Google DeepMind 
 Dave Orr Google DeepMind 
 
 Jakub Pachocki OpenAI 
 Ethan Perez Anthropic 
 
 Mary Phuong Google DeepMind 
 Fabien Roger Anthropic 
 
 Joshua Saxe Meta 
 Buck Shlegeris Redwood Research 
 
 Martín Soto UK AI Security Institute 
 Eric Steinberger Magic 
 
 Jasmine Wang UK AI Security Institute 
 Wojciech Zaremba OpenAI 
 
 Bowen Baker † OpenAI 
 
 
 Rohin Shah † Google DeepMind 
 
 
 Vlad Mikulik † Anthropic 
 
 
 
 
 
 
 

 
 Abstract

 AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

 
 * * footnotetext: Equal first authors, † \dagger Equal senior authors. Correspondence: tomek.korbak@dsit.gov.uk and mikita@apolloresearch.ai . The paper represents the views of the individual authors and not necessarily of their affiliated institutions. 
 
 
 
 
 Expert endorsers: 
 
 
 
 
 
 Samuel R. Bowman Anthropic 
 
 
 John Schulman Thinking Machines 
 
 
 
 
 Geoffrey Hinton University of Toronto 
 
 
 Ilya Sutskever Safe Superintelligence Inc 
 
 
 
 
 
 
 
 1 Chain of Thought Offers a Unique Safety Opportunity

 
 The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act through human language, one might hope they are easier to understand than other approaches to AI. The discovery of Ch

... (truncated, 40 KB total)
Resource ID: e2a66d86361bb628 | Stable ID: MWU1MzlkZD