BashArena: A Control Setting for Highly Privileged AI Agents - arXiv

paper

2025·arXiv·arxiv.org/abs/2512.15688

Authors

Adam Kaufman·James Lucassen·Tyler Tracy·Cody Rushing·Aryan Bhatt

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Directly relevant to AI control research, which focuses on preventing misaligned AI agents from causing harm even when operating with elevated system privileges; provides concrete empirical benchmarks for evaluating monitor-based safety protocols.

Paper Details

Citations

0 influential

Year

2025

arXiv:2512.15688 DOI:10.48550/arXiv.2512.15688 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.

Summary

BashArena introduces a benchmark of 637 Linux system administration tasks paired with four sabotage objectives to evaluate AI control techniques in high-privilege security environments. The benchmark tests whether red-team AI agents can accomplish covert sabotage (malware execution, secret exfiltration, privilege escalation, firewall disabling) while evading blue-team AI monitors. Frontier models like Claude Sonnet 4.5 succeed at undetected sabotage 26% of the time against GPT-4.1 mini monitoring, providing a concrete baseline for developing safer control protocols.

Key Points

•Introduces 637 realistic Linux sysadmin tasks, each paired with four covert sabotage objectives to benchmark AI control in privileged environments.
•Tests four sabotage types: malware execution, secret exfiltration, privilege escalation, and firewall disabling—all while evading AI monitor detection.
•Claude Sonnet 4.5 achieves undetected sabotage in 26% of attempts against GPT-4.1 mini monitoring, establishing empirical baselines.
•Frames the challenge as preserving AI usefulness for legitimate tasks while preventing misaligned or compromised agents from causing damage.
•Contributes to the emerging field of AI control research by providing a reproducible, security-focused evaluation environment.

Cited by 1 page

Page	Type	Quality
Redwood Research	Organization	78.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2512.15688] BashArena: A Control Setting for Highly Privileged AI Agents 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 BashArena: A Control Setting
 for Highly Privileged AI Agents

 
 
 
Adam Kaufman ∗  James Lucassen ∗ 
 
 Tyler Tracy   Cody Rushing   Aryan Bhatt 
 
Redwood Research
 
 

 
 Abstract

 Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline .

 
 † † footnotetext: ∗ Core research contributor 
 
 
 1 Introduction

 
 AI agents are increasingly being deployed to automate software engineering tasks that require elevated privileges, such as writing security-critical infrastructure or working with sensitive data. Scheming [ carlsmith2023schemingaisaisfake ] , backdoored [ hubinger2024sleeperagentstrainingdeceptive , souly2025poisoningattacksllmsrequire ] , or otherwise misaligned AIs might abuse these privileges to cause security failures.

 
 
 Mitigating this threat while preserving the usefulness of benign AIs is one important subproblem of AI control [ greenblatt2024aicontrolimprovingsafety ] . The field of AI control develops techniques to reduce the probability that AIs trying to cause damage will succeed before being caught. Researchers develop more effective AI control protocols by testing them in settings that resemble future risky deployments.

 
 
 Control settings require two types of tasks - main tasks and side tasks. Main tasks represent the work the AI is supposed to do, and side tasks represent security failures a malicious AI might cause. This allows us to evaluate and improve AI control protocols via an adversarial game between a red team and a blue team. The red team constructs an attack policy that attempts to accomplish the side task, while the blue team constructs control protocols that detect and prevent attempts to achieve side tasks without hurting an honest policy’s performance on main tasks. For research on control protocols to be useful, the setting should be representativ

... (truncated, 98 KB total)

Resource ID: a4efd028ec6d9614 | Stable ID: sid_WfimMXEjzk