Skip to content
Longterm Wiki
Back

Tice et al. 2024

paper

Authors

Cameron Tice·Philipp Alexander Kreer·Nathan Helm-Burger·Prithviraj Singh Shahani·Fedor Ryzhenkov·Fabien Roger·Clement Neo·Jacob Haimes·Felix Hofstätter·Teun van der Weij

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper addresses sandbagging—where AI models deliberately underperform during evaluations—and proposes detection methods, which is critical for ensuring reliable capability assessments of frontier AI systems.

Paper Details

Citations
9
0 influential
Year
2024

Metadata

arxiv preprintprimary source

Abstract

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.

Summary

This paper addresses the problem of sandbagging—where AI models deliberately or unintentionally underperform during capability evaluations—by proposing a novel detection method based on weight noise injection. The authors demonstrate that sandbagging models exhibit anomalous performance improvements when subjected to increasing levels of noise, in contrast to non-sandbagging models which show predictable degradation. This distinctive response pattern serves as a model-agnostic signal for detecting sandbagging across various architectures and sizes. The method also shows promise in eliciting full performance from models that underperform without explicit instruction, providing a practical tool for AI evaluation and oversight.

Cited by 1 page

PageTypeQuality
AI Capability SandbaggingRisk67.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202655 KB
# Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Cameron Tice
Philipp Alexander Kreer
Nathan Helm-Burger
Prithviraj Singh Shahani
Fedor Ryzhenkov
Jacob Haimes
Felix Hofstätter
Teun van der Weij

###### Abstract

Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or “sandbagging.” We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.

AI Safety, AI Governance, Sandbagging, Capability Evaluations, Noise Injection

## 1 Introduction

Capability evaluations are central to ensuring the safe deployment and continued development of frontier AI systems. More precisely, capability evaluations are used by AI companies (DeepMind, [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib18 ""); OpenAI, [2023](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib34 ""); Anthropic, [2023](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib4 "")), and governments (NIST, [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib32 ""); UK AISI, [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib41 ""); European Parliament, [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib20 "")) to assess the risk level of models. These risk assessments guide the formulation of policies concerning the development, deployment, and monitoring of AI systems (SHENK, [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib40 "")).

Despite their importance, capability evaluations face a fundamental challenge: current incentive structures (UK Department for Science, [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib42 "")) may encourage suppressing capabilities to avoid deployment-restricting regulation, a behavior known as sandbagging. We define sandbagging as strategic underperformance during evaluation (van der Weij et al., [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib43 ""); Benton et al., [2024](https://ar5iv.labs.arxiv.org/html/2412.01784#bib.bib9 "")). Where _strategic_ implies the underperformance was not accidental but initiated by developers of the AI system or the AI system itself, and _underperformance_ implies the capabilities exhibited during evaluation are 

... (truncated, 55 KB total)
Resource ID: a0a238c7e56a938f | Stable ID: OTQ0ZWY5Mj