Apollo Research — Research Overview
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Apollo Research
Apollo Research is a dedicated AI safety organization; this page indexes all their published work and is a useful starting point for tracking their contributions to scheming evaluations, interpretability, and AI governance.
Metadata
Summary
Apollo Research's research page aggregates their publications across evaluations, interpretability, and governance, with a focus on detecting and understanding AI scheming, deceptive alignment, and loss of control risks. Key featured works include a taxonomy for Loss of Control preparedness and stress-testing anti-scheming training methods in partnership with OpenAI. The page serves as a central index for their contributions to AI safety science and policy.
Key Points
- •Features a Loss of Control taxonomy and preparedness framework for policymakers, covering degrees and dynamics of LoC scenarios.
- •Includes evaluations of frontier models for in-context scheming behavior, including work done in collaboration with OpenAI on anti-scheming training.
- •Interpretability research covers sparse dictionary learning, linear probes for detecting strategic deception, and mechanistic description methods.
- •Governance work addresses EU AI Act compliance, national security AI assurance, and frameworks for AI incident reporting regimes.
- •Research spans both technical safety (evaluations, interpretability) and policy-facing outputs, making Apollo a cross-domain AI safety lab.
Cited by 9 pages
| Page | Type | Quality |
|---|---|---|
| Apollo Research | Organization | 58.0 |
| Alignment Evaluations | Approach | 65.0 |
| AI Evaluation | Approach | 72.0 |
| AI Safety Cases | Approach | 91.0 |
| Scheming & Deception Detection | Approach | 91.0 |
| Sleeper Agent Detection | Approach | 66.0 |
| Technical AI Safety Research | Crux | 66.0 |
| Mesa-Optimization | Risk | 63.0 |
| Scheming | Risk | 74.0 |
Cached Content Preview
# Research
Featured

- Governance
### The Loss of Control Playbook: Degrees, Dynamics, and Preparedness
Despite increasing policy and research attention to Loss of Control (LoC), decision- and policymakers are still operating in the absence of a uniform conceptualization and definition of LoC. Today, we bridge this gap through a novel taxonomy and preparedness framework for LoC that explores the degrees and dynamics of LoC through a comprehensive best-in-class literature review and presents actionable tools to counter relevant threats to national security and humanity.

- Evaluations
### Stress Testing Deliberative Alignment for Anti-Scheming Training
We partnered with OpenAI to assess frontier language models for early signs of scheming — covertly pursuing misaligned goals — in controlled stress-tests (non-typical environments), and studied a training method that can significantly reduce (but not eliminate) these behaviors. Our results are complicated by models’ increasing ability to recognize our evaluation environments as tests of their alignment.
Our Research
Filter
- Select all
- Interpretability
- Evaluations
- Governance
Governance

- Governance
#### Internal Deployment of AI Models and Systems in the EU AI Act
Governance

- Governance
#### Assurance of Frontier AI Built for National Security
Governance

- Governance
#### AI Behind Closed Doors: a Primer on The Governance of Internal Deployment
Governance

- Governance
#### Capturing and Countering Threats to National Security: a Blueprint for an Agile AI Incident Regime
Interpretability

- Interpretability
#### Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Interpretability

- Interpretability
#### Detecting Strategic Deception Using Linear Probes
Governance

- Governance
#### Precursory Capabilities: A Refinement to Pre-deployment Information Sharing and Tripwire Capabilities
Evaluations

- Evaluations
#### Frontier Models are Capable of In-Context Scheming
Evaluations

- Evaluations
#### Towards Safety Cases For AI Scheming
Interpretability

- Interpretability
#### Identifying
... (truncated, 4 KB total)560dff85b3305858 | Stable ID: YjBkNGIzMD