MATS Stream: Black-Box Scheming Monitors – Marius Hobbhahn (Apollo Research)

web

matsprogram.org·matsprogram.org/stream/hobbhahn

This is a MATS (ML Alignment Theory Scholars) program page for Marius Hobbhahn's research stream focused on building black-box monitors to detect AI scheming in agentic settings, relevant to AI safety through technical monitoring and control research.

Metadata

Importance: 52/100othereducational

Summary

This MATS program stream, led by Apollo Research CEO Marius Hobbhahn, focuses on developing high-quality black-box monitors to detect scheming behavior in complex agentic AI settings. Building on prior successful work, the stream explores fine-tuning monitors, adversarial training, and generating harder datasets for scheming detection. Scholars are expected to produce a paper suitable for a top ML conference.

Key Points

•Stream focuses on black-box scheming monitors for agentic AI settings, continuing prior successful MATS work.
•Potential research directions include fine-tuning monitors, adversarial training, 'science of monitoring', and harder scheming datasets.
•Mentored by Marius Hobbhahn, CEO of Apollo Research, an organization specializing in evals, scheming, and AI control.
•Scholars are expected to submit a paper to a top ML conference and have access to large compute budgets.
•Key reference papers include 'Frontier Models are Capable of In-context Scheming' and 'Alignment faking in large language models'.

Cached Content Preview

HTTP 200Fetched Apr 27, 20265 KB

Marius Hobbhahn at MATS: Summer 2026 
 

 Research Mentors About 
 
 About 
 
 Mid Missouri Region 
 
 Kansas City Region 
 
 Apply Research Mentors About 
 
 About 
 
 
 
 
 
 
 Program 
 
 Marius Hobbhahn Marius Hobbhahn

 We will continue working on black-box monitors for scheming in complex agentic settings, building on the success of the previous stream.

 &#8203;

 See here for details.

 Apply Apply View all streams Stream overview

 The entire stream will be dedicated to building high-quality black-box scheming monitors. 

 &#8203;

 In a previous stream, we have had great initial success designing black box monitors for scheming and we&apos;ve made substantial progress since then. 

 &#8203;

 We will continue to work on black box scheming monitors. Potential directions for this version of the stream could include: a) fine-tuning monitors, b) improving monitors through adversarial training, c) more "science of monitoring", d) harder and more complex datasets for scheming to improve monitor training/selection and testing.

 Mentors

 Marius Hobbhahn Apollo Research , CEO 
 
 
 
 
 
 
 
 
 London — Control Scheming and Deception Dangerous Capability Evals Monitoring Marius Hobbhahn is the CEO of Apollo Research, where he also leads the evals team. Apollo is an evals research organization focused on scheming, evals and control. Prior to starting Apollo, he did a PhD in Bayesian ML and worked on AI forecasting at Epoch. 

 Read more I want you to grow as a researcher/engineer as fast as possible. Therefore, the goal is always for scholars to write a paper and submit it to a top ML conference before the end of the extension. We will also try to produce other outputs, such as intermediate blog posts. 

 &#8203;

 We have a large compute budget available for the stream, allowing us to run numerous experiments for fine-tuning or generating big data pipelines.

 Mentorship style

 We have two weekly 60-minute calls by default. Since everyone will work on the same project, these calls will be with all participants of the stream. I respond on slack on a daily basis for asynchronous messages. Scholars will have a lot of freedom for day-to-day decisions and direction setting. In the best case, you will understand the project better than me after a few weeks and have a clear vision for where it should be heading. I recommend scholars focus 100% of their work time on the project and not pursue anything on the side. I think this way people will learn the most in MATS. 

 Representative papers

 AI Control: Improving Safety Despite Intentional Subversion : as a general intro to control and a reference for which techniques have already been tried. 
 Shell Games: Control Protocols for Adversarial AI Agents : as control work in agentic setting (note, this is an old version) .
 Towards evaluations-based safety cases for AI scheming : for general context on why we want to build monitors to begin with .
 Frontier Models are Capable of In-context Scheming : for

... (truncated, 5 KB total)

Resource ID: e56951e0e87f9174 | Stable ID: sid_kIF6pA5T6A