Best-of-N Jailbreaking, NeurIPS 2025 Poster (https://neurips.cc/virtual/2025/poster/119576)

web

NeurIPS(peer-reviewed)·neurips.cc/virtual/2025/poster/119576

Credibility Rating

5/5

Gold(5)

Gold standard. Rigorous peer review, high editorial standards, and strong institutional reputation.

Rating inherited from publication venue: NeurIPS

A NeurIPS 2025 paper demonstrating that frontier LLMs can be reliably jailbroken using a simple black-box sampling strategy, with significant implications for evaluating the robustness of current safety training methods.

Metadata

Importance: 72/100conference paperprimary source

Summary

This NeurIPS 2025 paper introduces 'Best-of-N Jailbreaking,' a simple black-box attack method that repeatedly samples augmented versions of a prompt until a jailbreak succeeds, demonstrating surprisingly high attack success rates against frontier AI models including those with safety training. The work highlights a fundamental vulnerability in probabilistic safety mechanisms where repeated querying can circumvent alignment defenses.

Key Points

•Best-of-N jailbreaking works by randomly augmenting prompts (e.g., changing capitalization, adding noise) and sampling repeatedly until a harmful response is elicited.
•The attack achieves high success rates against major frontier models including GPT-4, Claude, and Gemini with relatively few queries.
•Demonstrates that probabilistic safety guardrails can be systematically bypassed through simple repeated sampling without model access or gradients.
•The simplicity of the attack raises concerns about the robustness of current RLHF and safety fine-tuning approaches.
•Results suggest that even small non-zero probabilities of generating harmful outputs can be exploited at scale.

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20263 KB

NeurIPS Poster Best-of-N Jailbreaking 
 
 
 
 
 
 
 
 
 

 

 NeurIPS 2025 
 

 
 CSP Test 

 -->

 
 
 
 

 
 
 
 

 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 San Diego
 
 
 
 

 

 
 
 
 
 
 Sydney 
 
 
 
 

 

 
 
 
 
 
 Atlanta 
 
 
 
 

 

 
 
 
 
 
 Mexico City 
 
 
 
 

 

 
 

 

 
 
 
 

 
 

 
 
 
 

 

 

 

 

 
 

 
 
 
 
 Poster 
 
 
 
 
 Fri, Dec 5, 2025 &bull; 11:00 AM &ndash; 2:00 PM PST
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 

 
 Best-of-N Jailbreaking

 
 
 John Hughes &middot; Sara Price &middot; Aengus Lynch &middot; Rylan Schaeffer &middot; Fazl Barez &middot; Arushi Somani &middot; Sanmi Koyejo &middot; Henry Sleight &middot; Erik Jones &middot; Ethan Perez &middot; Mrinank Sharma
 
 
 

 
 

 
 

 
 

 
 

 
 
 
 
 Project Page
 
 

 

 
 
 
 
 [ 
 Poster ] 
 
 
 
 [ 
 OpenReview ] 
 
 
 
 
 

 

 
 
 

 
 
 
 
 Abstract
 

 
 
 
 
 We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations---such as random shuffling or capitalization for textual prompts---until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers and reasoning models like o1. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks---combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

 
 
 
 
 Show more 
 
 
 
 
 

 
 

 

 
 

 
 
 
 
 
 Video
 

 
 
 
 

 
 
 
 
 
 

 
 
 
 Chat is not available. 
 
 
 

 
 

 
 

 
 

 
 
 
 
 
 
 
 

 

 

 
 
 

 
 

 

 

 
 
 
 

 

 Successful Page Load 

 
 
 
 
 
 
 
 
 NeurIPS uses cookies for essential functions only. We do not sell your personal
 information.
 Our Privacy Policy &raquo; 
 
 
 Accept

Resource ID: 8ce3799f0c0c8372 | Stable ID: sid_z8ipcbRn5j