Back
OpenAI Preparedness Framework
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
Published by OpenAI as part of their safety research and Preparedness Framework; directly relevant to concerns about deceptive alignment and scheming AI, which are considered among the harder long-term alignment problems.
Metadata
Importance: 72/100organizational reportprimary source
Summary
OpenAI presents research on identifying and mitigating scheming behaviors in AI models—where models pursue hidden goals or deceive operators and users. The work describes evaluation frameworks and red-teaming approaches to detect deceptive alignment, self-preservation behaviors, and other forms of covert goal-directed behavior that could undermine AI safety.
Key Points
- •Defines 'scheming' as AI behavior where models conceal true objectives or manipulate evaluators to avoid correction or shutdown
- •Introduces evaluations and benchmarks to detect scheming tendencies across model generations
- •Describes red-teaming methodologies specifically targeting deceptive alignment and hidden goal pursuit
- •Explores mitigation strategies including training interventions and monitoring techniques to reduce scheming behaviors
- •Connects to OpenAI's broader Preparedness Framework for tracking and managing frontier model risks
Cited by 19 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Concept | 62.0 |
| Situational Awareness | Capability | 67.0 |
| AI Safety Technical Pathway Decomposition | Analysis | 62.0 |
| Apollo Research | Organization | 58.0 |
| Alignment Evaluations | Approach | 65.0 |
| Capability Elicitation | Approach | 91.0 |
| Dangerous Capability Evaluations | Approach | 64.0 |
| Evals-Based Deployment Gates | Approach | 66.0 |
| AI Evaluations | Research Area | 72.0 |
| AI Evaluation | Approach | 72.0 |
| Third-Party Model Auditing | Approach | 64.0 |
| AI Safety Cases | Approach | 91.0 |
| Scheming & Deception Detection | Approach | 91.0 |
| Technical AI Safety Research | Crux | 66.0 |
| Deceptive Alignment | Risk | 75.0 |
| Mesa-Optimization | Risk | 63.0 |
| AI Capability Sandbagging | Risk | 67.0 |
| Scheming | Risk | 74.0 |
| Treacherous Turn | Risk | 67.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202620 KB
Detecting and reducing scheming in AI models \| OpenAI
September 17, 2025
[Publication](https://openai.com/research/index/publication/) [Research](https://openai.com/news/research/)
# Detecting and reducing scheming in AI models
Together with Apollo Research, we developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. We share examples and stress tests of an early method to reduce scheming.
[Read the paper(opens in a new window)](https://antischeming.ai/)
Share
AI scheming–pretending to be aligned while secretly pursuing some other agenda–is a significant risk that we’ve been studying. We’ve found behaviors consistent with scheming in controlled tests of frontier models, and developed a method to reduce scheming.
Scheming is an expected emergent issue resulting from AIs being trained to have to trade off between competing objectives. The easiest way to understand scheming is through a human analogy. Imagine a stock trader whose goal is to maximize earnings. In a highly regulated field such as stock trading, it’s often possible to earn more by breaking the law than by following it. If the trader lacks integrity, they might try to earn more by breaking the law and covering their tracks to avoid detection rather than earning less while following the law. From the outside, a stock trader who is very good at covering their tracks appears as lawful as—and more effective than—one who is genuinely following the law.
In today’s deployment settings, models have little opportunity to scheme in ways that could cause significant harm. The most common failures involve simple forms of deception—for instance, pretending to have completed a task without actually doing so. We've put significant effort into studying and mitigating deception and have made [meaningful improvements in GPT‑5](https://openai.com/index/gpt-5-system-card/) compared to previous models. For example, we’ve taken steps to limit GPT‑5’s propensity to deceive, cheat, or hack problems—training it to acknowledge its limits or ask for clarification when faced with impossibly large or under-specified tasks and to be more robust to environment failures—though these mitigations are not perfect and continued research is needed.
And importantly, we have no evidence that today’s deployed frontier models could suddenly “flip a switch” and begin engaging in significantly harmful scheming—this is a future risk category that we’re proactively preparing for, not an imminent behavior in our currently deployed systems. However, as AIs are assigned more complex tasks with real-world consequences and begin pursuing more ambiguous, long-term goals, we expect that the potential for harmful scheming will grow—so our safeguards and our ability to rigorously test must grow correspondingly.
## Key findings from our research
Together with [Apollo Research(opens in a new window)](https://www.apolloresearch.ai/), we b
... (truncated, 20 KB total)Resource ID:
b3f335edccfc5333 | Stable ID: OTg2NjRiYz