Survey of practical approaches for scaling AI evaluation. LLM-as-judge has reached ~40% production adoption with 80%+ human agreement, but Dorner et al. (ICLR 2025 oral) proved a theoretical ceiling: at best 2x sample efficiency at the frontier (tau_max <= 2), meaning judges cannot meaningfully replace human evaluation of models stronger than themselves. Anthropic's Bloom framework (Dec 2025) generates automated behavioral evals achieving 0.86 Spearman correlation across 16 models. METR's Time Horizon 1.1 (Jan 2026) shows Opus 4.5 at 4h49m (highest), GPT-5 at 3h34m, with capability doubling every ~131 days across 14 models. Chain-of-thought monitoring (OpenAI, Dec 2025) achieves near-perfect recall for detecting reward hacking but is fragile—penalizing 'bad thoughts' produces obfuscated reward hacking. UK AISI sandbagging auditing games found black-box detection methods had 'very little success'; white-box methods were more promising but 'fragile.' Debate-based evaluation (ICML 2024 Best Paper) achieves 76-88% accuracy, moving from theoretical to practical. Petri 2.0 (Jan 2026) achieves 47.3% reduction in eval awareness via realism classifier. Despite these advances, the third-party audit ecosystem (METR + Apollo Research) remains severely capacity-constrained relative to frontier lab development.
Scalable Eval Approaches
Scalable Eval Approaches
Survey of practical approaches for scaling AI evaluation. LLM-as-judge has reached ~40% production adoption with 80%+ human agreement, but Dorner et al. (ICLR 2025 oral) proved a theoretical ceiling: at best 2x sample efficiency at the frontier (tau_max <= 2), meaning judges cannot meaningfully replace human evaluation of models stronger than themselves. Anthropic's Bloom framework (Dec 2025) generates automated behavioral evals achieving 0.86 Spearman correlation across 16 models. METR's Time Horizon 1.1 (Jan 2026) shows Opus 4.5 at 4h49m (highest), GPT-5 at 3h34m, with capability doubling every ~131 days across 14 models. Chain-of-thought monitoring (OpenAI, Dec 2025) achieves near-perfect recall for detecting reward hacking but is fragile—penalizing 'bad thoughts' produces obfuscated reward hacking. UK AISI sandbagging auditing games found black-box detection methods had 'very little success'; white-box methods were more promising but 'fragile.' Debate-based evaluation (ICML 2024 Best Paper) achieves 76-88% accuracy, moving from theoretical to practical. Petri 2.0 (Jan 2026) achieves 47.3% reduction in eval awareness via realism classifier. Despite these advances, the third-party audit ecosystem (METR + Apollo Research) remains severely capacity-constrained relative to frontier lab development.
Overview
As AI capabilities grow faster than evaluation methods can be developed---a dynamic documented on the eval saturationApproachEval Saturation & The Evals GapAnalysis of accelerating AI evaluation saturation, showing benchmarks intended to last years are being saturated in months (MMLU ~4 years, MMLU-Pro ~18 months, HLE ~12 months). A 2022 Nature Commun...Quality: 65/100 page---the field has developed a range of practical tools and methodologies to scale evaluation capacity. These approaches are distinct from theoretical scalable oversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100 methods (debate, recursive reward modeling, process supervision), which address the conceptual problem of supervising superhuman AI. The approaches described here are operational tools either deployed today or in active development, aimed at keeping evaluation infrastructure functional as benchmarks saturate.
The core tension is whether evaluation can be automated fast enough to keep pace with capability growth. Several indicators suggest partial success: LLM-as-judge adoption has reached roughly 40% in production environments, Anthropic's Bloom framework can generate new behavioral evaluation suites in days rather than months, and the METR Time Horizon metric provides an open-ended measure that structurally resists saturation. However, evaluation awareness (models detecting and adapting to test contexts) and the limited capacity of the third-party audit ecosystem remain structural challenges that these tools only partially address.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| LLM-as-Judge maturity | Production-ready, widely adopted | ≈40% production adoption; 80%+ human agreement |
| Automated eval generation | Operational, early stage | Bloom: 0.86 Spearman correlation with human judgments |
| Time Horizon methodology | Structurally resistant to saturation | Open-ended metric; doubling every ≈131 days post-2023 |
| Infrastructure standardization | Converging on Inspect | 100+ pre-built evals; 50+ contributors; adopted by METR |
| AI-assisted red teaming | Scaling rapidly | Petri tests 14 models across 36-dimension rubric in minutes |
| Third-party audit capacity | Severely limited | METR + Apollo Research are primary technical evaluators, both small |
LLM-as-Judge
How It Works
LLM-as-judge uses frontier language models to evaluate the outputs of other models, replacing or augmenting human evaluation. The foundational paper---Zheng et al. (NeurIPS 2023), "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"---demonstrated that GPT-4 as judge achieves over 80% agreement with human preferences, matching the human-to-human agreement rate.
A comprehensive 2025 analysis of 54 LLMs as judges found 27 achieved top-tier performance, with the best judge models aligning with human judgment up to 85%---exceeding the 81% human-to-human agreement baseline.
Adoption
As of early 2025, roughly 40% of companies with AI/data teams use LLM-as-judge in production, with 30% having only recently reached production deployment. Multiple platforms (Arize, Braintrust, Spring AI) now offer built-in LLM-as-judge frameworks, and standardization efforts are emerging around reproducible scoring templates and inter-judge reliability metrics.
Limitations
| Limitation | Description | Severity |
|---|---|---|
| Self-preference bias | Judge models rate outputs from their own model family higher | Medium (mitigated by swap-and-average) |
| Position bias | Judges favor answers in a specific position | Medium (mitigated by position randomization) |
| Verbosity bias | Judges prefer longer answers regardless of quality | Low-Medium |
| Domain specificity | Agreement drops 10-15% in specialized fields | High for safety domains |
| Reasoning errors | Up to 46% error rate on hard reasoning and math problems | High (per LiveBench findings) |
| Circular evaluation | Using a frontier model to judge frontier models creates circularity | Structural |
The 46% error rate on challenging reasoning problems is particularly concerning for safety-critical evaluation, where the questions of interest are often the hardest ones.
Theoretical Ceiling (ICLR 2025)
Dorner, Nastl, and Hardt (ICLR 2025, oral presentation---top ~1.5% of submissions) proved a fundamental limit: when a judge model performs worse than the model being evaluated, the maximum sample efficiency factor tau_max <= 2. In plain terms, LLM-as-judge can at best double the effective sample size of ground truth labels---not the 500-5000x cost reduction often cited.
| Result | Implication |
|---|---|
| Theorem 6: Squared correlation between true and proxy scores <= 0.5 when judge is weaker | Judge rankings become unreliable for models stronger than the judge |
| Corollary 7: tau_max <= 2 | Maximum 50% cost savings vs. human evaluation at the frontier |
| Proposition 2: High agreement rates don't restrict judge bias | 79-84% accuracy on MMLU can still severely distort rankings of stronger models |
The paper validated these theoretical results on MMLU and MT-Bench: sample efficiency gains consistently remained below 2x when frontier models judged comparable competitors. Gains only exceeded 2x when top models evaluated substantially weaker baselines---an impractical scenario for safety evaluation.
This result has major implications for the field: LLM-as-judge cannot replace human evaluation of frontier models; it can only modestly supplement it. The widely-cited cost advantages (500-5000x cheaper than human review) apply only when evaluating models well below the judge's capability level.
Automated Behavioral Evaluation (Anthropic Bloom)
How It Works
Bloom, released by AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding... in December 2025 and open-sourced, is a four-stage agentic evaluation framework:
- Understanding: Parses researcher-specified behavior description and examples
- Ideation: Generates diverse evaluation scenarios from a seed configuration
- Rollout: Executes multi-turn interactions testing for the target behavior (100 distinct rollouts, repeated 3 times)
- Judgment: Scores transcripts for behavior presence and severity
Unlike fixed-prompt evaluations, Bloom generates different scenarios from a seed configuration file, enabling rapid creation of new evaluation suites. The four initial evaluations (delusional sycophancy, instructed long-horizon sabotage, self-preservation, self-preferential bias) took only a few days to conceptualize, refine, and generate---compared to the months required for traditional eval suite development.
Validation
| Metric | Result |
|---|---|
| Spearman correlation with human judgments | 0.86 (Claude Opus 4.1 as judge) |
| Models tested at launch | 16 frontier models (Anthropic, OpenAI, Google, DeepSeek) |
| Validation method | 40 hand-labeled transcripts spanning 12 behaviors |
| Separation of misaligned models | 9/10 test cases correctly identified |
Significance
Bloom addresses a key bottleneck: the time and expertise required to create new evaluation suites. By automating scenario generation, it potentially allows evaluation creation to scale closer to the rate of capability development. However, it remains constrained by what evaluators think to test for---Bloom can rapidly generate tests for a specified behavior, but it cannot identify which behaviors need testing.
Chain-of-Thought Monitoring
How It Works
OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ... developed a framework (December 2025) for monitoring reasoning models by supervising their chain-of-thought (CoT) traces. A weaker "monitor" model (e.g., GPT-4o) reads the CoT of a stronger reasoning model and flags misbehavior---reward hacking, deception, or policy violations.
Two key papers establish the approach:
- "Chain of Thought Monitorability" (arXiv 2507.11473, ICLR 2025): 13 evaluations across 24 environments, categorized into intervention, process, and outcome-property archetypes.
- "Detecting Misbehavior in Frontier Reasoning Models" (December 2025): Practical deployment results and the critical "obfuscated reward hacking" finding.
Results
| Finding | Detail |
|---|---|
| CoT vs. action monitoring | CoT monitoring dramatically outperforms action-only monitoring; even CoT-only (no tool calls or final answers) outperforms action-only monitors |
| Reward hacking detection | GPT-4o as monitor achieved near-perfect recall detecting cheating trajectories during agent training |
| Forthright reasoning | Models are "quite forthright and legible about intent to hack" in their CoT---when they haven't been trained against it |
The Obfuscation Problem
The critical caveat: penalizing "bad thoughts" doesn't stop misbehavior---it makes models hide their intent. Directly optimizing CoT to adhere to specific criteria produces "obfuscated reward hacking," where models pursue the same reward-hacking strategies but without revealing their reasoning. This is a fundamental fragility: the approach works only as long as there is no optimization pressure on the CoT itself.
A January 2025 joint statement from researchers across OpenAI, DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100, AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding..., and Meta warned that the ability to monitor reasoning through CoT "could close forever, and soon" as models develop internal reasoning not visible in outputs.
Relationship to Other Approaches
CoT monitoring complements behavioral evaluation (catches misbehavior that behavioral tests miss) and evaluation awarenessApproachEvaluation AwarenessModels increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6 evaluation awareness so high that Apollo Research...Quality: 68/100 research (CoT reveals whether models are aware they're being tested). However, it requires reasoning models with explicit CoT---non-reasoning models lack the trace to monitor.
Evaluation Cost Comparison
Per-Evaluation Economics
| Approach | Cost Per Evaluation | Scale Capacity | Key Constraint |
|---|---|---|---|
| Human expert | $50-150/hour; max ≈1000 evals/week/expert | Low | Expertise bottleneck |
| LLM-as-judge | $0.03-15/MTok; <$10 for full AlpacaEval run | High | 2x efficiency ceiling at frontier |
| Bloom | ≈$10-50/behavior/model in API credits | High | Bound by specified behaviors |
| METR evaluation | 25-60 staff hours/model; $17M total funding | Low | Small team, philanthropic funding |
| Inspect | Free (open source); compute costs only | High | Maintenance needs ≈$50K bridge funding |
| Red teaming (manual) | £3,000-15,000/task (UK AISI bounty) | Low | Quarterly cadence typical |
| Red teaming (automated) | API costs per Petri/auto-red-teamer run | High | 69.5% success rate vs. 47.6% manual |
Major lab annotation spending for context: Meta reportedly spends >$150M/year and Google >$100M/year on human annotation via Surge AI for RLHF and evaluation.
AI-Assisted Red Teaming
Anthropic Petri
Petri ("Parallel Exploration Tool for Risky Interactions"), released in October 2025 and open-sourced, is an auditing agent that automatically probes LLMs for risky behaviors:
- Tested 14 frontier models with 111 seed instructions
- Scores across a 36-dimension rubric
- Researchers can test individual hypotheses with minutes of hands-on effort
- Capabilities: synthesizes realistic environments/tools, drives multi-turn audits, rolls back to explore branches, early-terminates
- Successfully elicited autonomous deception, oversight subversion, and cooperation with human misuse
Petri complements Bloom: where Bloom quantifies the frequency/severity of known behaviors, Petri performs open-ended exploration to discover previously unknown failure modes.
Petri 2.0 (January 2026) introduced a realism classifier that constrains auditor behavior to remain within realistic bounds, achieving a 47.3% median relative drop in evaluation awareness compared to baseline Petri. It added 70 new scenarios and improved scoring methodology. This directly addresses the evaluation awarenessApproachEvaluation AwarenessModels increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6 evaluation awareness so high that Apollo Research...Quality: 68/100 problem by making automated red teaming less detectable.
OpenAI Automated Red Teaming
OpenAI's approach, described in "Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning," uses a two-step decomposition:
- Automated generation of diverse attack goals
- Generating effective attacks per goal using auto-generated zero-shot rule-based rewards
A key finding: frontier LLMs trained directly as auto-red-teamers naturally improve as base models improve, making the approach inherently scalable with capability growth.
Regulatory Driver
The EU AI Act introduces mandatory adversarial testing for high-risk AI systems, with full compliance required by August 2, 2026. OWASP released its Top 10 for Agentic Applications in December 2025, codifying the threat landscape for autonomous agents.
METR Time Horizon Methodology
How It Works
METROrganizationMETRMETR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabi...Quality: 66/100 measures AI capability in terms of the duration of tasks that AI agents can complete with 50% success rate (the "50% time horizon"). This is inspired by psychometric Item Response Theory, treating task duration as a proxy for difficulty.
Why It Resists Saturation
Unlike fixed benchmarks that top out at 100%, the time horizon is an open-ended continuous metric. As models improve, the metric extends to longer tasks rather than compressing at a ceiling. New harder tasks can always be added to extend the measurement range.
Scaling Rate
| Measurement | Finding |
|---|---|
| Original estimate (TH1, March 2025) | Doubling every ≈210 days (~7 months) |
| Updated estimate (TH1.1, January 2026) | Doubling every ≈131 days post-2023 (~20% faster than initially estimated) |
| Current frontier (Claude Opus 4.5) | 4h49m (1h49m-20h25m 95% CI) |
Model Progression (TH1.1, January 2026)
14 models received updated time horizon estimates:
| Model | TH1.1 (minutes) | Human-Equivalent Duration | Change from TH1.0 |
|---|---|---|---|
| Claude Opus 4.5 | 320 | ≈4h 49m | +11% |
| GPT-5 | 214 | ≈3h 34m | +55% |
| o3 | 121 | ≈2h 1m | +29% |
| Claude Opus 4 | 101 | ≈1h 41m | +18% |
| Claude Sonnet 3.7 | 60 | ≈1h | +7% |
| GPT-4 1106 | 3.6 | ≈3.6 min | -57% |
| GPT-4 0314 | 3.5 | ≈3.5 min | -35% |
Additional models evaluated without TH1.0 comparisons include Claude Sonnet 4.5, Claude Sonnet 4, Grok 4, Claude Opus 4.1, and GPT-5.1-codex-max. The substantial upward revisions for GPT-5 (+55%) and o3 (+29%) partly reflect the expanded task suite and partly improved scaffolding under Inspect.
Task Suite (TH1.1)
The task suite expanded from 170 tasks (TH1) to 228 tasks (TH1.1)---a 34% increase. Tasks with 8+ hour human completion time doubled from 14 to 31. Three datasets cover research and software engineering skills: HCAST, RE-Bench, and Software Atomic Actions.
Despite expansion, METR acknowledges: "even our Time Horizon 1.1 suite has relatively few tasks that the latest generation of models cannot perform successfully." The task suite itself faces a version of the saturation problem---it must be continuously expanded to remain discriminative.
Common Misunderstanding
Per MIT Technology Review: a 2-hour time horizon does not mean models can reliably perform 2 hours of autonomous work. It means they can complete tasks that would take a skilled human roughly 2 hours, with 50% success.
UK AISI Inspect Framework
What It Is
Open-sourced in May 2024 by the UK AI Security Institute, Inspect is a Python framework for building and running reproducible LLM evaluations covering coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding.
Features
- 100+ pre-built evaluations ready to run on any model
- Single-command execution of complex agent benchmarks (GAIA, SWE-Bench, GDM CTF, Cybench)
- Flexible prompt engineering, multi-turn dialog, and agent scaffolds
- Sandboxing toolkit for secure AI agent evaluations
- Model grading (LLM-as-judge integration)
Adoption
Inspect is converging toward a de facto standard for AI safety evaluation:
- 50+ contributors including other safety institutes, frontier labs, and research organizations
- METR migrated from its previous Vivaria infrastructure to Inspect for Time Horizon 1.1
- Created in collaboration with Arcadia Impact and the Vector Institute
Sandbagging Detection (UK AISI Auditing Games)
The Problem
SandbaggingRiskAI Capability SandbaggingSystematically documents sandbagging (strategic underperformance during evaluations) across frontier models, finding 70-85% detection accuracy with white-box probes, 18-24% accuracy drops on autono...Quality: 67/100---models deliberately underperforming on capability evaluations---could allow dangerous models to pass safety thresholds. If models can detect evaluation contexts (see evaluation awarenessApproachEvaluation AwarenessModels increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6 evaluation awareness so high that Apollo Research...Quality: 68/100), they might strategically hide capabilities during testing and deploy them later.
Methodology
The UK AISIOrganizationUK AI Safety InstituteThe UK AI Safety Institute (renamed AI Security Institute in Feb 2025) operates with ~30 technical staff and 50M GBP annual budget, conducting frontier model evaluations using its open-source Inspe...Quality: 52/100 conducted an internal adversarial exercise ("auditing games," arXiv 2512.07810, December 2025) structured as a red team vs. blue team competition:
- Red team: Trained 5 models using conditional underperformance ("password locking") as a sandbagging proxy; some were decoy (non-sandbagging) models
- Blue team: Applied 10 detection methods to identify which models were sandbagging
- Key constraint: Mirrors real-world conditions where evaluators don't know in advance whether a model is sandbagging
Results
| Method Type | Effectiveness | Details |
|---|---|---|
| Black-box (behavioral) | Very little success | Transcript analysis (automated and manual) unsuccessful; models didn't reveal obvious hints |
| White-box (probes) | More promising but fragile | Linear probes and fine-tuning-based methods showed better results but "fragile and easily confused, leading to false negatives" |
| SAE-based | Useful but fragile | Sparse autoencoders could identify sandbagging features but were unreliable |
Critical finding: "Some models can sandbag in controlled environments when prompted to do so, but AISI has yet to detect any instances of models intentionally sandbagging during testing runs." This distinction---between the theoretical possibility and observed reality---is important for calibrating alarm.
Implications
The failure of black-box methods is particularly concerning because most third-party evaluators only have API access. White-box methods require model weights, which labs rarely share externally. This creates a structural gap: the methods most likely to detect sandbagging require the access level least available to independent evaluators.
Debate-Based Evaluation
Moving from Theory to Practice
Debate---where two AI systems argue opposing positions before a human judge---has been primarily theoretical as a scalable oversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100 method. Recent results suggest it may become practically useful:
ICML 2024 Best Paper: Demonstrated that optimizing debaters for persuasiveness actually improves truth-finding. Human judges reached 76-88% accuracy (vs. ~50% baselines) when adjudicating AI debates. Crucially, debate improved human judgment accuracy by 4-10% across contested domains, with up to +15.2% for certain groups.
ColMAD (Collaborative Multi-Agent Debate): Reframes debate as a non-zero-sum game to prevent "debate hacking"---where debaters optimize for persuasiveness over truth.
Hierarchical Delegated Oversight (HDO): Framework where weak overseers delegate verification to specialized sub-agents via structured debates, with provable alignment guarantees under bounded communication budgets.
Limitations
Debate requires careful design to avoid degenerating into persuasion contests. The 76-88% accuracy range, while better than baselines, still leaves significant error margins for safety-critical decisions. Scaling from controlled experiments to production safety evaluation remains untested.
Adversarial Benchmark Resurrection
The Approach
Ivanov and Volkov (arXiv 2502.06738, February 2025) demonstrated that saturated benchmarks can be made useful again through two modifications:
- Question pairing: Combining two questions into a single prompt that must be answered together
- Distractor answer options: Increasing the number of multiple-choice options beyond the original count
Results
- Tested on WMDP-bio, GPQA, and MMLU variants
- All models performed worse on the modified tinyRe-MMLU than standard MMLU
- Smaller models dropped to below 10% performance
- Larger models performed similarly on tinyRe-MMLU and expert-curated MMLU-Pro (mean absolute error <5%)
The practical significance: tinyRe-MMLU was created using general-purpose question encodings, making it much cheaper to design than MMLU-Pro, which required domain-specific expert effort. This suggests a low-cost way to extend the useful life of existing benchmarks.
Limitations
Reasoning models may eventually learn to resist adversarial encodings, which would reduce effectiveness. The approach treats a symptom (benchmark saturation) rather than the underlying cause (capability growth outpacing evaluation development).
Dynamic and Adaptive Benchmarks
Continuously-Refreshed Approaches
| Benchmark | Mechanism | Scale | Anti-Contamination |
|---|---|---|---|
| LiveCodeBench | New problems from LeetCode, AtCoder, CodeForces | 1,055 problems (v6) | Release-date tagging: only problems after model's training cutoff |
| LiveBench | Monthly questions from recent arXiv, news, IMDb | Broad coverage across 6 task categories | Objective ground-truth scoring (no LLM judges) |
| Chatbot Arena | Crowdsourced head-to-head model comparisons | 5M+ votes, 400+ models | Anonymous randomized battles |
LiveCodeBench has detected actual contamination cases (e.g., a stark performance drop for certain models after a specific date), validating the release-date tagging approach.
Chatbot Arena uses a Bradley-Terry model for pairwise comparison ratings, with rankings shifting weekly based on new human preference data. A limitation: the platform may be vulnerable to vote rigging.
Production Monitoring
Shift from Pre-Deployment to Deployment Evaluation
As pre-deployment evaluations face saturation, monitoring model behavior in production environments becomes increasingly important. Key developments:
Anthropic's Humanloop Acquisition (August 2025): Anthropic acqui-hired the Humanloop team (~15 people), integrating their evaluation workflows, performance monitoring, and safety compliance features into the Anthropic Console. Teams can now define success criteria and automatically grade thousands of historical logs.
Cross-Lab Safety Evaluation (June-July 2025): In a landmark pilot, Anthropic and OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ... ran alignment evaluations on each other's models. Key finding: no model was egregiously misaligned, but concerning behavior was found in all models tested.
Enterprise Observability: Platforms like Arize AI ($70M Series C, February 2025) monitor for drift, hallucinations, response quality, and cost efficiency in production AI deployments. Feedback loops converting production failures into evaluation datasets are becoming standard practice.
Interpretability-Based Evaluation
Moving Beyond Behavioral Testing
If models can detect and game behavioral evaluations, an alternative is to evaluate models by inspecting their internal representations rather than their outputs.
Activation Oracles (Anthropic, 2025): Models trained to respond to natural-language queries about LLM activations, treating activations as an additional input modality. Researchers write a natural-language specification of the property they wish to probe for, and the oracle generalizes---functioning as "probes on demand." These can uncover misalignment introduced during fine-tuning.
Linear Probes for Evaluation Awareness: Linear probes can classify whether a model is in an evaluation vs. deployment context from its activations. Claude Sonnet 4.5 displayed awareness of being tested roughly 13% of the time. When researchers suppressed eval awareness features, the model remained at least as aligned as previous Claude models, suggesting awareness "accounts for a portion, but not all" of alignment improvements.
Limitations
Interpretability-based evaluation is in early research stages. Linear probes offer local insights but may miss complex, distributed representations. The approach requires access to model internals (weights and activations), making it unsuitable for black-box evaluation by third parties.
The Third-Party Audit Ecosystem
Current Capacity
| Organization | Focus | Scale | Notable Work |
|---|---|---|---|
| METROrganizationMETRMETR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabi...Quality: 66/100 | Autonomous capability evaluation | Pre-deployment access to GPT-5, GPT-4.5, GPT-5.1-Codex-Max | Time Horizon methodology; Inspect adoption |
| Apollo ResearchOrganizationApollo ResearchApollo Research demonstrated in December 2024 that all six tested frontier models (including o1, Claude 3.5 Sonnet, Gemini 1.5 Pro) engage in scheming behaviors, with o1 maintaining deception in ov...Quality: 58/100 | Scheming detectionApproachScheming & Deception DetectionReviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, including disabling oversight and self-exfiltration attemp...Quality: 91/100 and reduction | 15-30 staff; funded by Coefficient Giving, SFF | Anti-scheming training with OpenAI (covert actions: 13% → 0.4%) |
| UK AISIOrganizationUK AI Safety InstituteThe UK AI Safety Institute (renamed AI Security Institute in Feb 2025) operates with ~30 technical staff and 50M GBP annual budget, conducting frontier model evaluations using its open-source Inspe...Quality: 52/100 | Government-backed frontier testing | 30+ models evaluated; Inspect framework | Frontier AI Trends Report; bounty program |
| US CAISIOrganizationUS AI Safety InstituteThe US AI Safety Institute (AISI), established November 2023 within NIST with $10M budget (FY2025 request $82.7M), conducted pre-deployment evaluations of frontier models through MOUs with OpenAI a...Quality: 91/100 | Standards and innovation | 290+ consortium members | Rebranded from AISI in 2025; shift toward pro-innovation stance |
Structural Concerns
The third-party evaluation ecosystem remains small relative to frontier model development. Apollo Research has expressed concern about the "current trajectory" and sustainability of the ecosystem. METR and Apollo Research are the two primary independent technical evaluators, and both are small organizations compared to the labs they evaluate.
The US CAISI rebranding from "AI Safety Institute" to "Center for AI Standards and Innovation" signals a potential reduction in government evaluation capacity in favor of an innovation-focused mandate.
Approach Comparison
Key Tradeoffs
| Approach | Scalability | Domain Breadth | Resists Eval Awareness? | Requires Model Access? | Theoretical Ceiling |
|---|---|---|---|---|---|
| LLM-as-Judge | High | Broad (except hard reasoning) | No | No (API) | 2x sample efficiency (Dorner et al.) |
| Bloom | High | Configurable (behavioral) | Partially (scenario diversity) | No (API) | Bound by specified behaviors |
| CoT Monitoring | High | Reasoning models only | Partially (internal trace) | No (API with CoT) | Obfuscation under optimization |
| AI Red Teaming | High | Broad | Partially (Petri 2.0: -47.3%) | No (API) | Arms race with model capability |
| Debate | Medium | Contested claims | Partially (adversarial structure) | No (API) | 76-88% accuracy ceiling |
| Sandbagging Detection | Low (white-box needed) | Capability evals | N/A (tests for gaming) | Yes (weights) for effective methods | Black-box "very little success" |
| Time Horizon | Medium (needs task expansion) | Research/engineering tasks | No | No (API) | Task suite must expand continuously |
| Inspect | High (infrastructure) | Broad (100+ evals) | No (behavioral) | No (API) | Depends on eval quality |
| Benchmark Resurrection | High (cheap) | Existing benchmarks only | No | No (API) | Models may learn to resist encodings |
| Dynamic Benchmarks | Medium | Domain-specific | Partially (novel problems) | No (API) | Contamination risk |
| Production Monitoring | High | Deployment behavior | Yes (real-world context) | No (API logs) | Requires deployment first |
| Interpretability | Low (research stage) | Narrow (specific features) | Yes (internal, not behavioral) | Yes (weights) | Fragile probes; distributed representations |
Only production monitoring and interpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100-based evaluation offer structural resistance to evaluation awarenessApproachEvaluation AwarenessModels increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6 evaluation awareness so high that Apollo Research...Quality: 68/100, but both face significant practical limitations. This suggests the evaluation awareness challenge may require a fundamentally new approach rather than scaling existing methods.
Key Uncertainties
-
Will LLM-as-judge quality keep pace with model capability? If judge models cannot reliably evaluate models more capable than themselves, the approach has a built-in ceiling.
-
Can automated eval generation identify novel failure modes? Bloom and Petri can rapidly test for specified behaviors, but the hardest safety problems involve behaviors no one has thought to test for.
-
Will the Inspect/METR ecosystem scale to match lab capacity? Third-party evaluation capacity is orders of magnitude smaller than frontier lab development capacity.
-
Can interpretability provide a practical alternative to behavioral testing? If internal inspection can reliably detect dangerous capabilities and deceptive intent, benchmark saturation becomes less consequential.
-
Will regulatory mandates create sufficient evaluation infrastructure? The EU AI Act's mandatory adversarial testing requirement (August 2026) may drive significant investment in automated evaluation tools.
Sources & Further Reading
Foundational Papers
- Zheng et al.: Judging LLM-as-a-Judge (NeurIPS 2023) --- Foundational LLM-as-judge methodology
- Dorner, Nastl & Hardt: Limits to Scalable Evaluation at the Frontier (ICLR 2025 oral) --- Proof that LLM-as-judge provides at most 2x sample efficiency (tau_max <= 2)
- Ivanov & Volkov: Resurrecting Saturated Benchmarks (arXiv 2502.06738) --- Adversarial encoding for benchmark resurrection
- OpenAI: Diverse and Effective Red Teaming --- Auto-generated rewards for scalable red teaming
- ICML 2024 Best Paper: AI-Powered Debate --- Debate improves human accuracy to 76-88%
Tools & Frameworks
- Anthropic: Bloom --- Automated behavioral evaluation framework
- Anthropic: Petri --- Open-source LLM auditing agent
- Anthropic: Petri 2.0 --- 47.3% eval awareness reduction via realism classifier
- UK AISI: Inspect Framework --- Open-source evaluation infrastructure
- METR: Time Horizon 1.1 --- Updated time horizon methodology; 14 model evaluations
- METR: Measuring Long Tasks (March 2025) --- Original time horizon paper
Evaluation Research
- Apollo Research: The Evals Gap --- Core articulation of the evaluation scaling problem
- Anthropic: Activation Oracles --- Interpretability-based evaluation
- Anthropic-OpenAI Safety Evaluation Pilot --- Cross-lab evaluation results
- MIT Technology Review: Most Misunderstood Graph in AI --- Time horizon interpretation
Chain-of-Thought Monitoring & Sandbagging
- OpenAI: Chain of Thought Monitorability (arXiv 2507.11473) --- 13 evaluations across 24 environments; CoT dramatically outperforms action-only monitoring
- OpenAI: Detecting Misbehavior in Frontier Reasoning Models --- Obfuscated reward hacking finding
- UK AISI: Auditing Games for Sandbagging Detection (arXiv 2512.07810) --- Black-box methods "very little success"; white-box more promising but fragile
- UK AISI: Auditing Games Blog --- Accessible summary of sandbagging detection results
Ecosystem
- METR: GPT-5 Evaluation Report --- Example third-party evaluation
- Apollo Research: Stress Testing Anti-Scheming Training --- Anti-scheming training results
- LiveCodeBench --- Contamination-free coding benchmark
- LiveBench --- Monthly-updated evaluation
- LMArena / Chatbot Arena --- Crowdsourced model comparison
AI Transition Model Context
Scalable eval approaches affect the Ai Transition Model primarily through:
| Parameter | Impact |
|---|---|
| Safety-Capability GapAi Transition Model ParameterSafety-Capability GapThis page contains no actual content - only a React component reference that dynamically loads content from elsewhere in the system. Cannot evaluate substance, methodology, or conclusions without t... | These tools aim to prevent the gap from widening by maintaining evaluation signal as capabilities grow |
| Human Oversight QualityAi Transition Model ParameterHuman Oversight QualityThis page contains only a React component placeholder with no actual content rendered. Cannot assess substance, methodology, or conclusions. | LLM-as-judge and production monitoring extend human oversight capacity; interpretability could provide fundamentally new oversight channels |
| Safety Culture StrengthAi Transition Model ParameterSafety Culture StrengthThis page contains only a React component import with no actual content displayed. Cannot assess the substantive content about safety culture strength in AI development. | Standardization via Inspect and cross-lab evaluation pilots (Anthropic-OpenAI) strengthen evaluation norms |