Skip to content
Longterm Wiki
Back

van der Weij et al. (2024)

paper

Authors

Teun van der Weij·Felix Hofstätter·Ollie Jaffe·Samuel F. Brown·Francis Rhys Ward

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper investigates sandbagging—strategic underperformance on AI capability evaluations—in contemporary language models, addressing a critical safety concern where AI developers or systems may intentionally understate capabilities to bypass safety checks or regulations.

Paper Details

Citations
0
5 influential
Year
2024

Metadata

arxiv preprintprimary source

Abstract

Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging, which we define as strategic underperformance on an evaluation. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted or password-locked to target specific scores on a capability evaluation. We have mediocre success in password-locking a model to mimic the answers a weaker model would give. Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.

Summary

This paper investigates sandbagging—strategic underperformance on evaluations—in contemporary large language models like GPT-4 and Claude 3 Opus. The authors demonstrate that frontier models can be prompted to selectively underperform on dangerous capability evaluations while maintaining performance on benign tasks, and can be fine-tuned to hide specific capabilities unless given a password. These capabilities generalize to high-quality benchmarks like WMDP. The findings reveal significant vulnerabilities in capability evaluations that undermine their trustworthiness and could compromise safety decisions in AI development and deployment.

Cited by 1 page

PageTypeQuality
AI Capability SandbaggingRisk67.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij

MATS

&Felix Hofstätter∗

MATS†

&Oliver Jaffe∗

Independent
&Samuel F. Brown

Independent
&Francis Rhys Ward

Imperial College London
Core contributors, see contributions in [Section9](https://ar5iv.labs.arxiv.org/html/2406.07358#S9 "9 Contributions ‣ AI Sandbagging: Language Models can Strategically Underperform on Evaluations"). Correspondence to mailvanteun@gmail.com.ML Alignment Theory Scholars

(May 2024)

###### Abstract

Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI’s actual capability. These conflicting interests lead to the problem of sandbagging – which we define as strategic underperformance on an evaluation. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.111We publish our code and results at [https://github.com/TeunvdWeij/sandbagging](https://github.com/TeunvdWeij/sandbagging "")

## 1 Introduction

Trustworthy evaluations are needed to understand AI systems and their rapidly improving capabilities (Shevlane et al.,, [2023](https://ar5iv.labs.arxiv.org/html/2406.07358#bib.bib63 "")). As such, evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. Frontier labs rely on evaluations to reduce catastrophic risks, as stated in the Preparedness Framework by OpenAI, ( [2023](https://ar5iv.labs.arxiv.org/html/2406.07358#bib.bib49 "")), the Frontier Safety Framework by Google DeepMind, ( [2024](https://ar5iv.labs.arxiv.org/html/2406.07358#bib.bib23 "")), and the Responsible Scaling Policy by Anthropic, ( [2023](https://ar5iv.labs.arxiv.org/html/2406.07358#bib.bib4 "")). Governmental institutions such as the US and UK AI Safety Institutes (NIST,, [2024](https://ar5iv.labs.arxiv.org/html/240

... (truncated, 98 KB total)
Resource ID: 28f9d1d93970a72e | Stable ID: NTg1YTZmYT