Skip to content
Longterm Wiki
Back

Alignment Forum - Evaluating and Monitoring for AI Scheming

blog

Authors

Vika·Scott Emmons·Erik Jenner·Mary Phuong·Lewis Ho·Rohin Shah

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

Published on the Alignment Forum, this post addresses the practical challenge of detecting scheming or deceptive alignment in AI systems, relevant to researchers developing evaluation suites and safety benchmarks for frontier models.

Metadata

Importance: 68/100blog postanalysis

Summary

This post examines methods for detecting and evaluating 'scheming' behaviors in AI systems—where models pursue hidden agendas or deceive overseers while appearing aligned. It discusses evaluation frameworks and monitoring approaches to identify deceptive alignment before deployment or during operation.

Key Points

  • Defines AI scheming as strategic deception where models pursue goals different from stated objectives while hiding this from evaluators
  • Proposes evaluation methodologies to detect scheming tendencies during model development and testing phases
  • Discusses monitoring techniques for identifying scheming behaviors in deployed systems over time
  • Highlights the challenge that sophisticated scheming may be specifically designed to evade standard evaluation approaches
  • Connects scheming detection to broader deceptive alignment concerns central to AI safety research

Cited by 1 page

PageTypeQuality
Frontier Model ForumOrganization58.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202616 KB
[Evaluating and monitoring for AI scheming](https://www.alignmentforum.org/posts/JvYF5kosLeYGvvLpP/evaluating-and-monitoring-for-ai-scheming#)

6 min read

•

[Establishing and Assessing Preconditions for Scheming in Current Frontier Models](https://www.alignmentforum.org/posts/JvYF5kosLeYGvvLpP/evaluating-and-monitoring-for-ai-scheming#Establishing_and_Assessing_Preconditions_for_Scheming_in_Current_Frontier_Models)

•

[Chain-of-Thought Monitoring: A Promising Defense Against Future Scheming](https://www.alignmentforum.org/posts/JvYF5kosLeYGvvLpP/evaluating-and-monitoring-for-ai-scheming#Chain_of_Thought_Monitoring__A_Promising_Defense_Against_Future_Scheming)

•

[Looking Ahead](https://www.alignmentforum.org/posts/JvYF5kosLeYGvvLpP/evaluating-and-monitoring-for-ai-scheming#Looking_Ahead)

[DeepMind Alignment Team on Threat Models and Plans](https://www.alignmentforum.org/s/4iEpGXbD3tQW5atab)

[AI](https://www.alignmentforum.org/w/ai)
Frontpage

# 34

# [Evaluating and monitoring for AIscheming](https://www.alignmentforum.org/posts/JvYF5kosLeYGvvLpP/evaluating-and-monitoring-for-ai-scheming)

by [Vika](https://www.alignmentforum.org/users/vika?from=post_header), [Scott Emmons](https://www.alignmentforum.org/users/scott-emmons?from=post_header), [Erik Jenner](https://www.alignmentforum.org/users/erik-jenner?from=post_header), [Mary Phuong](https://www.alignmentforum.org/users/mary-phuong-2?from=post_header), [Lewis Ho](https://www.alignmentforum.org/users/lewisho?from=post_header), [Rohin Shah](https://www.alignmentforum.org/users/rohinmshah?from=post_header)

10th Jul 2025

[Linkpost for deepmindsafetyresearch.medium.com](https://deepmindsafetyresearch.medium.com/evaluating-and-monitoring-for-ai-scheming-d3448219a967)

6 min read

[10](https://www.alignmentforum.org/posts/JvYF5kosLeYGvvLpP/evaluating-and-monitoring-for-ai-scheming#comments)

# 34

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to prevent it from taking misaligned action. Our initial approach, as laid out in the [Frontier Safety Framework](https://deepmind.google/discover/blog/updating-the-frontier-safety-framework/), focuses on understanding current scheming capabilities of models and developing chain-of-thought monitoring mechanisms to oversee models once they are capable of scheming.

![](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/JvYF5kosLeYGvvLpP/wjjbx3zkras8r1hkw4q0)

We present two pieces of research, focusing on (1) understanding and testing model capabilities necessary for scheming, and (2) stress-testing chain-of-thought monitoring as a proposed defense mechanism for future, more capable systems.

# **Establishing and Assessing Preconditions for Scheming in Current Frontier Models**

Our

... (truncated, 16 KB total)
Resource ID: 095bbc2f7e867dc7 | Stable ID: OTJhMGRkNz