Skip to content
Longterm Wiki

Untrusted Monitoring is Default; Trusted Monitoring is not

web

Author

J Bostock

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A LessWrong post contributing to the AI control literature, particularly relevant to debates about how to structure AI oversight hierarchies and what assumptions are warranted when using AI systems to monitor other AI systems.

Forum Post Details

Karma
30
Comments
0
Forum
lesswrong
Forum Tags
AI

Metadata

Importance: 62/100analysis

Summary

This post argues that in AI control frameworks, untrusted monitoring (using potentially misaligned AI to oversee other AI) should be treated as the baseline assumption, while trusted monitoring requires additional justification. It explores the implications of this framing for how we design AI oversight systems and what safety guarantees we can realistically claim.

Key Points

  • Untrusted monitoring—where the monitoring AI may itself be misaligned—is the natural default state, not a degraded fallback.
  • Trusted monitoring requires strong additional justifications and cannot simply be assumed as the starting point for safety arguments.
  • This reframing has significant implications for how robust AI control protocols need to be designed.
  • The post challenges assumptions in AI control literature that treat trusted oversight as an accessible or easy-to-achieve baseline.
  • Designing safety systems with untrusted monitoring as the default leads to more conservative and robust safety guarantees.

Cached Content Preview

HTTP 200Fetched Apr 7, 20268 KB
# Untrusted Monitoring is Default; Trusted Monitoring is not
By J Bostock
Published: 2026-03-20
***These views are my own and not necessarily representative of those of any colleagues with whom I have worked on AI control.***

*TL;DR: It's much cheaper and quicker to just throw some honeypots at your monitor models than to robustly prove trustedness for every model you want to use. Therefore I think the most likely future involves untrusted monitoring with some monitor validation as a default path.*

This post talks about two different ways of monitoring AI systems, trusted monitoring, and untrusted monitoring. You can find papers distinguishing between them [here](https://arxiv.org/pdf/2312.06942), [here](https://arxiv.org/pdf/2504.10374), and [here](https://arxiv.org/pdf/2602.20628), a video [here](https://www.youtube.com/watch?v=0pgEMWy70Qk), or you can look at this graphic which didn't make it into Gardner-Challis et. al. (2026):

![Abstract.png](https://res.cloudinary.com/lesswrong-2-0/image/upload/v1774000302/lexical_client_uploads/vemn0vs51gl80u8qqqdy.png)

Context
-------

From when I started working at untrusted monitoring until very recently[^8gw6vkx8dvp], I assumed that trusted monitoring was the default policy that AI companies would use, and that untrusted monitoring was an exotic, high-effort policy which would be much less likely to be used.

Now I think it's the other way round.

Full Trustedness is Hard
------------------------

The distinction between an untrusted model and a trusted one exists in the map. A trusted model is normally described as one for which we have **strong** evidence that it isn't coherently pursuing long-term misaligned goals. This kind of evidence is very difficult to find; mostly it just boils down to arguments from the model's inability to scheme very well.

[Buck has written about](https://www.lesswrong.com/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models) how this means we'll hit a ceiling on our ability to trust models: below that, the inability arguments hold and the models can be trusted; above it, we don't know.

In that scenario, somebody (AI companies?) is trying their best to assess and report trustedness. They are empowered to use a large amount of resources to get a good picture of the problem, and are empowered to throw their hands up and say "Yep, this model is beyond our evaluations, we cannot trust it." Their reports give a picture with a fairly clean divide, like the Agent-3/Agent-4 divide in [AI 2027](https://ai-2027.com/).

There are a few reasons I don't expect this to happen.

*   It's a lot of effort to run a high-quality eval regularly. OpenAI, Anthropic, and GDM have started to release models in units of 0.1, and have scaled back their capabilities evals to match. Claude has maxed out several rule-out evals, so its AI R&D capabilities were assessed by interviewing some Anthropic employees. I expect trustedness evals to suffer the same fate, giving blurry resul

... (truncated, 8 KB total)
Resource ID: 1e17a7ce0cf54e8e | Stable ID: sid_eX4O5A4wqb