Measuring and improving coding audit realism with deployment resources

web

2026·LessWrong·lesswrong.com/posts/EjxzHh5Guhxc9DLW2/measuring-and-impro...

Authors

Connor Kissane·Monte M·Fabien Roger

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

Relevant to researchers designing AI safety evaluations and audits, particularly those concerned with ensuring that safety testing reflects real deployment conditions and can detect subtle misalignment or deceptive behavior.

Forum Post Details

Karma

Comments

Forum

lesswrong

Forum Tags

Language Models (LLMs)AI

Metadata

Importance: 62/100analysis

Summary

This post investigates the realism of coding audits used in AI safety evaluations, examining how deployment-level resources affect the quality and accuracy of such audits. It proposes methods to measure and improve audit realism to better reflect real-world conditions where AI systems might attempt to deceive or circumvent oversight.

Key Points

•Coding audits are a key tool for evaluating AI safety, but their realism compared to actual deployment conditions is often unclear.
•Deployment resources (e.g., compute, time, tooling) significantly affect how realistic an audit is and what threats it can detect.
•The post proposes metrics and methodologies to quantify audit realism and identify gaps between evaluation and deployment settings.
•Improving audit realism is important for catching deceptive alignment and subtle misalignment that only manifests under real-world conditions.
•Practical recommendations are offered for calibrating audit procedures to better approximate deployment-time threat models.

Cached Content Preview

HTTP 200Fetched Apr 7, 202658 KB

# Measuring and improving coding audit realism with deployment resources
By Connor Kissane, Monte M, Fabien Roger
Published: 2026-03-23
**TL;DR** We study *realism win rate*, a metric for measuring how distinguishable [Petri](https://alignment.anthropic.com/2025/petri/) audit transcripts are from real deployment interactions. We use it to evaluate the effect of giving the auditor real deployment resources (system prompts, tool definitions, and codebases). Providing these resources to the auditor increases the average realism win rate from 4.6% to 32.8% for reward hacking audits, and doesn’t significantly change hack rate.

💻[Code](https://github.com/ckkissane/petri-realism-win-rate)

Research done as part of the [Anthropic Fellows Program](https://alignment.anthropic.com/2024/anthropic-fellows-program/).

Introduction
============

Automated auditing agents like [Petri](https://alignment.anthropic.com/2025/petri/) are increasingly used to evaluate the propensities of LLMs. One central challenge is *eval-awareness*: models recognizing when they are being tested and then adjusting their behavior. In the worst case, a scheming model could fake alignment whenever it suspects being in an evaluation, causing auditors to systematically underestimate risk. Currently the Petri auditor must construct system prompts, tool definitions, and codebases from scratch, and these synthetic artifacts often contain tells that a capable model can detect. This motivates the need to measure and improve the realism of audits.

Previous work primarily measured realism by either seeing if models verbalized situational awareness ([Goldowsky-Dill et al.](https://www.lesswrong.com/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment), [Sonnet 4.5 System Card](https://www-cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd.pdf)) (which would fail to catch unverbalized situational awareness), or asking the model if a transcript was realistic post hoc ([Needham et al.](https://arxiv.org/abs/2505.23836?), [Ivanov and Kadochnikov](https://www.lesswrong.com/posts/TBk2dbWkg2F7dB3jb/it-s-hard-to-make-scheming-evals-look-realistic-for-llms), [Williams et al.](https://alignment.openai.com/prod-evals/)), which is hard to interpret because it’s unclear how well calibrated models are at answering this kind of question.

In this work, we:

1.  Measure the realism of agentic coding audits with *realism win rate* evals, a metric based on how often an evaluation transcript, paired alongside a real deployment transcript, is falsely deemed to be the 'real' transcript by an LLM judge (building on [Souly et al.](https://www.aisi.gov.uk/blog/investigating-models-for-misalignment)). We find this metric is smoother and more reliable than verbalized eval-awareness rates or post-hoc realism ratings.
2.  Show that grounding the auditor in real deployment resources improves realism in agentic coding settings. Giving the Petri auditor access to real system prompts, tool se

... (truncated, 58 KB total)

Resource ID: 0d9a61c706a2e662 | Stable ID: sid_crbCJPRgMV