Petri: An Open-Source Auditing Tool to Accelerate AI Safety Research

web

Anthropic·anthropic.com/research/petri-open-source-auditing

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Released October 2025 by Anthropic, Petri is a practical open-source tool that lowers the barrier to systematic AI behavioral auditing, relevant to researchers building safety evaluations or studying model behavior at scale.

Metadata

Importance: 72/100tool pagetool

Summary

Petri (Parallel Exploration Tool for Risky Interactions) is Anthropic's open-source automated auditing framework that deploys AI agents to test target models through diverse multi-turn conversations, then scores and summarizes behaviors. It addresses the scaling problem of manual model auditing by automating hypothesis testing across behaviors like deception, sycophancy, and self-preservation. The tool was used in Claude 4 system cards and by the UK AI Security Institute for evaluations.

Key Points

•Automates multi-turn conversation testing of AI models using an auditor agent that plans, interacts, and scores target model behavior across safety-relevant dimensions.
•Researchers provide natural language 'seed instructions' describing scenarios to investigate; Petri runs them in parallel and surfaces the most concerning transcripts for human review.
•Piloted across 14 frontier models using 111 diverse seed instructions covering deception, sycophancy, situational awareness, whistleblowing, and self-preservation behaviors.
•Used in Claude 4 and Claude Sonnet 4.5 system cards, cross-lab comparisons with OpenAI, and adopted by the UK AI Security Institute for pre-release testing.
•Open-source release enables external researchers and auditors to build both one-off exploratory evaluations and systematic benchmarks with minimal manual effort.

Cited by 1 page

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202610 KB

Alignment Petri: An open-source auditing tool to accelerate AI safety research

 Oct 6, 2025 Read the technical report Petri (Parallel Exploration Tool for Risky Interactions) is our new open-source tool that enables researchers to explore hypotheses about model behavior with ease. Petri deploys an automated agent to test a target AI system through diverse multi-turn conversations involving simulated users and tools; Petri then scores and summarizes the target’s behavior. 

 This automation handles a significant part of the work that one needs to do to build a broad understanding of a new model, and makes it possible to test many individual hypotheses about how a model might behave in some new circumstance with only minutes of hands-on effort.

 As AI becomes more capable and is deployed across more domains and with wide-ranging affordances, we need to evaluate a broader range of behaviors. This makes it increasingly difficult for humans to properly audit each model—the sheer volume and complexity of potential behaviors far exceeds what researchers can manually test.

 We’ve found it valuable to turn to automated auditing agents to help address this challenge. We used them in the Claude 4 and Claude Sonnet 4.5 System Cards to better understand behaviors such as situational awareness, whistleblowing, and self-preservation, and adapted them for head-to-head comparisons between heterogeneous models as part of a recent exercise with OpenAI . Our recent research release on alignment-auditing agents found these methods can reliably flag concerning behaviors in many settings. The UK AI Security Institute also used a pre-release version of Petri to build evaluations that they used in their testing of Sonnet 4.5.

 Manually building alignment evaluations often involves constructing environments, running models, reading transcripts, and aggregating the results. Petri automates much of this process. Researchers give Petri a list of seed instructions targeting scenarios and behaviors they want to test. Petri then operates on each seed instruction in parallel. For each seed instruction, an auditor agent makes a plan and interacts with the target model in a tool use loop. At the end, a judge scores each of the resulting transcripts across multiple dimensions so researchers can quickly search and filter for the most interesting transcripts. Researchers provide seed instructions in natural language describing what they want to investigate, and Petri handles the rest in parallel. The system simulates realistic environments and conducts multi-turn conversations with target models. At the end, LLM judges score each conversation across multiple safety-relevant dimensions and surface the most concerning transcripts for human review.

 Broad-coverage pilot-alignment evaluations

 Petri is a tool designed to support others in building evaluations, both for one-off exploration and more systematic benchmarking. As a pilot demonstration of its capabilities, we tested Petr

... (truncated, 10 KB total)

Resource ID: 62c583fb4c6af13a | Stable ID: sid_g69C9ODv7f