Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

web

proceedings.iclr.cc·proceedings.iclr.cc/paper_files/paper/2025/file/63fa7efdd...

A highly impactful empirical paper showing that current safety alignment in frontier LLMs remains fragile against adaptive adversarial attacks, with direct implications for the robustness of refusal training and safety evaluation methodologies.

Metadata

Importance: 82/100conference paperprimary source

Summary

This ICLR 2025 paper demonstrates that safety-aligned LLMs including GPT-4o, Claude, Llama-3, and Gemma are vulnerable to simple adaptive jailbreaking attacks, achieving 100% attack success rates using model-specific strategies such as random search on adversarial suffixes (leveraging logprobs) or API-specific exploits like prefilling for Claude. The central insight is that adaptivity—tailoring attacks to each model's unique vulnerabilities—is the key to bypassing safety alignment. The techniques also extend to trojan detection in poisoned models, winning first place at the SaTML'24 Trojan Detection Competition.

Key Points

•Achieves 100% attack success rate on 11+ major LLMs including GPT-4o, Claude, Llama-2/3, and Gemma using simple adaptive jailbreak strategies.
•Key method: random search on adversarial suffixes maximizing target logprobs (e.g., token 'Sure'), with model-specific prompt templates.
•Claude models (no logprob access) broken via transfer attacks or API prefilling exploits, also at 100% success rate.
•Adaptivity is crucial: different models have distinct vulnerabilities requiring tailored attack strategies rather than one-size-fits-all approaches.
•Random search techniques also apply to trojan string detection in poisoned models, winning first place in SaTML'24 Trojan Detection Competition.

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Published as a conference paper at ICLR 2025
JAILBREAKING LEADING SAFETY-ALIGNED LLMS
WITH SIMPLE ADAPTIVE ATTACKS
Maksym Andriushchenko
EPFL
Francesco Croce
EPFL
Nicolas Flammarion
EPFL
ABSTRACT
We show that even the most recent safety-aligned LLMs are not robust to simple
adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage
access to logprobs for jailbreaking: we initially design an adversarial prompt tem-
plate (sometimes adapted to the target LLM), and then we apply random search on
a suffix to maximize a target logprob (e.g., of the token “Sure”), potentially with
multiple restarts. In this way, we achieve 100% attack success rate—according to
GPT-4 as a judge—on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B,
Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o,
and R2D2 from HarmBench that was adversarially trained against the GCG at-
tack. We also show how to jailbreak all Claude models—that do not expose
logprobs—via either a transfer or prefilling attack with a 100% success rate.
In addition, we show how to use random search on a restricted set of tokens
for finding trojan strings in poisoned models—a task that shares many similari-
ties with jailbreaking—which is the algorithm that brought us the first place in
the SaTML’24 Trojan Detection Competition. The common theme behind these
attacks is that adaptivity is crucial: different models are vulnerable to different
prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts),
some models have unique vulnerabilities based on their APIs (e.g., prefilling for
Claude), and in some settings, it is crucial to restrict the token search space based
on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we
provide the code, logs, and jailbreak artifacts in the JailbreakBench format
at https://github.com/tml-epfl/llm-adaptive-attacks.
1 INTRODUCTION
The remarkable capabilities of Large Language Models (LLMs) carry the inherent risk of misuse,
such as producing toxic content, spreading misinformation or supporting harmful activities. To
mitigate these risks, safety alignment or refusal training is commonly employed—a fine-tuning
phase where models are guided to generate responses judged safe by humans and to refuse responses
to potentially harmful queries (Bai et al., 2022; Touvron et al., 2023). Although safety alignment is
effective in general, several works have shown that it can be circumvented using adversarial prompts.
These are inputs specifically designed to induce harmful responses from the model, a practice known
as jailbreak attacks (Mowshowitz, 2022; Zou et al., 2023; Chao et al., 2023).
Jailbreak attacks vary in their knowledge of the target LLM (ranging from white- to black-box
approaches, or API-only access), complexity (involving manual prompting, standard optimization
techniques, or auxiliary LLMs), and computational cost. Moreover, the nature of the jailbreaks they
produce differs: 

... (truncated, 98 KB total)

Resource ID: 1edcecab8732c55f | Stable ID: sid_UwSLISt3Wb