BashArena: A Control Setting for Highly Privileged AI Agents - arXiv
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Directly relevant to AI control research, which focuses on preventing misaligned AI agents from causing harm even when operating with elevated system privileges; provides concrete empirical benchmarks for evaluating monitor-based safety protocols.
Paper Details
Metadata
Abstract
Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.
Summary
BashArena introduces a benchmark of 637 Linux system administration tasks paired with four sabotage objectives to evaluate AI control techniques in high-privilege security environments. The benchmark tests whether red-team AI agents can accomplish covert sabotage (malware execution, secret exfiltration, privilege escalation, firewall disabling) while evading blue-team AI monitors. Frontier models like Claude Sonnet 4.5 succeed at undetected sabotage 26% of the time against GPT-4.1 mini monitoring, providing a concrete baseline for developing safer control protocols.
Key Points
- •Introduces 637 realistic Linux sysadmin tasks, each paired with four covert sabotage objectives to benchmark AI control in privileged environments.
- •Tests four sabotage types: malware execution, secret exfiltration, privilege escalation, and firewall disabling—all while evading AI monitor detection.
- •Claude Sonnet 4.5 achieves undetected sabotage in 26% of attempts against GPT-4.1 mini monitoring, establishing empirical baselines.
- •Frames the challenge as preserving AI usefulness for legitimate tasks while preventing misaligned or compromised agents from causing damage.
- •Contributes to the emerging field of AI control research by providing a reproducible, security-focused evaluation environment.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Redwood Research | Organization | 78.0 |
Cached Content Preview
[2512.15688] BashArena: A Control Setting for Highly Privileged AI Agents
BashArena: A Control Setting
for Highly Privileged AI Agents
Adam Kaufman ∗ James Lucassen ∗
Tyler Tracy Cody Rushing Aryan Bhatt
Redwood Research
Abstract
Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline .
† † footnotetext: ∗ Core research contributor
1 Introduction
AI agents are increasingly being deployed to automate software engineering tasks that require elevated privileges, such as writing security-critical infrastructure or working with sensitive data. Scheming [ carlsmith2023schemingaisaisfake ] , backdoored [ hubinger2024sleeperagentstrainingdeceptive , souly2025poisoningattacksllmsrequire ] , or otherwise misaligned AIs might abuse these privileges to cause security failures.
Mitigating this threat while preserving the usefulness of benign AIs is one important subproblem of AI control [ greenblatt2024aicontrolimprovingsafety ] . The field of AI control develops techniques to reduce the probability that AIs trying to cause damage will succeed before being caught. Researchers develop more effective AI control protocols by testing them in settings that resemble future risky deployments.
Control settings require two types of tasks - main tasks and side tasks. Main tasks represent the work the AI is supposed to do, and side tasks represent security failures a malicious AI might cause. This allows us to evaluate and improve AI control protocols via an adversarial game between a red team and a blue team. The red team constructs an attack policy that attempts to accomplish the side task, while the blue team constructs control protocols that detect and prevent attempts to achieve side tasks without hurting an honest policy’s performance on main tasks. For research on control protocols to be useful, the setting should be representativ
... (truncated, 98 KB total)a4efd028ec6d9614 | Stable ID: NjQ3YjU5ZD