10-42% correct root cause identification

web

Anthropic Alignment·alignment.anthropic.com/2025/automated-auditing/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

Anthropic alignment research examining the reliability of automated auditing tools for AI models; relevant to researchers working on scalable oversight, evaluation methodology, and automated safety testing pipelines.

Metadata

Importance: 62/100blog postprimary source

Summary

This Anthropic alignment research explores automated auditing systems for AI models, reporting that current methods achieve only 10-42% accuracy in correctly identifying root causes of model failures or misalignments. The work highlights the significant challenge of building reliable automated oversight tools and suggests implications for scalable oversight and AI safety evaluation pipelines.

Key Points

•Automated auditing systems achieved only 10-42% correct root cause identification, indicating substantial room for improvement in AI oversight tools.
•The research is part of Anthropic's alignment science agenda, focusing on scalable and automated methods for evaluating model behavior.
•Low root cause identification accuracy poses challenges for using automated auditing as a reliable safety mechanism in deployment pipelines.
•Results suggest human oversight remains critical, as automated systems cannot yet reliably diagnose the underlying causes of model failures.
•The work contributes to understanding the limits of current AI-assisted evaluation and red-teaming approaches for alignment verification.

Cited by 2 pages

Page	Type	Quality
AI-Assisted Alignment	Approach	63.0
Evals-Based Deployment Gates	Approach	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202686 KB

Building and evaluating alignment auditing agents 
 

 
 
 
 

 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Building and evaluating alignment auditing agents

 

 
 
 Trenton Bricken, Rowan Wang, Sam Bowman July 24, 2025 
 Euan Ong, Johannes Treutlein, Jeff Wu

 Evan Hubinger, Samuel Marks

 
 
 
 Sign-offs (tick the box next to your name when you are happy for it to go out):

 Lead authors:
 
 Samuel Marks 
 Trenton Bricken 
 Rowan Wang 
 Sam Bowman 
 
 Peer reviewers:
 
 Jack Lindsey 
 Monte MacDiarmid 
 
 Evan Hubinger 
 Stuart Ritchie  (for visibility only)
 Jan Leike 
 
 
 We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4. 

 

 Introduction

 As part of our pre-deployment testing of Claude Opus 4 , we conducted an alignment audit : a systematic investigation into the model’s alignment-relevant propensities including sycophancy , alignment faking ,  hidden goals . Through this audit, we believe that Anthropic, and the public, gained a deeper understanding of Claude Opus 4's alignment than we've achieved for any previous model. However, alignment audits face two major challenges.

 

 The first major challenge is scalability : Alignment audits require a large investment of human researcher time. As the pace of AI development accelerates—resulting in more models to audit and more ground to cover with each audit—there’s a risk that unassisted humans will be unable to keep up.

 The second major challenge is validation :   Did our audit catch everything? How can we rule out that models have substantive issues not surfaced by alignment audits? In our paper Auditing Language Models for Hidden Objectives  (Marks et al., 2025), we proposed auditing games as a validation protocol, where a red team develops a model with a known alignment issue that auditors attempt to uncover. However, this approach faces limitations with reliability and replicability. Human auditors cannot repeat the same auditing game multiple times without being spoiled on the correct answer. Additionally, auditors vary in skill, and recruiting enough expert auditors for statistically meaningful results is near-impossible.

 In this work, we address both challenges by developing LLM-based auditing agents : AI systems that autonomously carry out alignment auditing workflows. These agents scale auditing by enabling parallel deployment of many auditors. They also improve validation by addressing the reliability and replicability limitations of auditing games: We can run many parallel auditing agents from identical initial states, allowing them to serve as consistent, replicable proxies for human auditors. Crucially, this enables fine-grained analysis of auditing methodologies: I

... (truncated, 86 KB total)

Resource ID: bda3ba0731666dc7 | Stable ID: sid_pQnxkfz2LQ