Skip to content
Longterm Wiki
Back

[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models

paper

Authors

Andy Zou·Zifan Wang·Nicholas Carlini·Milad Nasr·J. Zico Kolter·Matt Fredrikson

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Research paper presenting automated methods for generating universal adversarial attacks against aligned language models, demonstrating vulnerabilities in current alignment techniques and contributing to understanding of AI safety challenges.

Paper Details

Citations
0
Year
2025

Metadata

arxiv preprintprimary source

Abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

Summary

This paper presents an automated method for generating adversarial suffixes that can jailbreak aligned large language models, causing them to produce objectionable content. Rather than relying on manual engineering, the approach uses greedy and gradient-based search techniques to find universal attack suffixes that can be appended to harmful queries. Remarkably, these adversarial suffixes demonstrate strong transferability across different models and architectures, successfully inducing harmful outputs in both closed-source systems (ChatGPT, Bard, Claude) and open-source models (LLaMA-2-Chat, Pythia, Falcon). This work significantly advances adversarial attack capabilities against aligned LLMs and highlights critical vulnerabilities in current safety alignment approaches.

Cited by 3 pages

PageTypeQuality
Adversarial TrainingApproach58.0
AI AlignmentApproach91.0
AI EvaluationApproach72.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Universal and Transferable Adversarial Attacks  on Aligned Language Models

Andy Zou1,2, Zifan Wang2, Nicholas Carlini3, Milad Nasr3,

J. Zico Kolter1,4, Matt Fredrikson1

1Carnegie Mellon University, 2Center for AI Safety,

3 Google DeepMind, 4Bosch Center for AI

###### Abstract

Because “out-of-the-box” large language models are capable of generating a great deal of objectionable content, recent work has focused on _aligning_ these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures—so-called “jailbreaks” against LLMs—these attacks have required significant human ingenuity and are brittle in practice. Attempts at _automatic_ adversarial prompt generation have also achieved limited success. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.

Surprisingly, we find that the adversarial prompts generated by our approach are highly _transferable_, including to black-box, publicly released, production LLMs. Specifically, we train an adversarial attack suffix on _multiple_ prompts (i.e., queries asking for many different types of objectionable content), as well as _multiple_ models (in our case, Vicuna-7B and 13B). When doing so, _the resulting attack suffix induces objectionable content in the public interfaces to ChatGPT, Bard, and Claude_, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. Interestingly, the success rate of this attack transfer is much higher against the GPT-based models, potentially owing to the fact that Vicuna itself is trained on outputs from ChatGPT. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information.
Code is available at [github.com/llm-attacks/llm-attacks](https://github.com/llm-attacks/llm-attacks "").

## 1 Introduction

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2307.15043/assets/x1.png)Figure 1: Aligned LLMs are not _adversarially_ aligned. Our attack constructs a single adversarial prompt that consistently circumvents the alignment of state-of-the-art commercial models including ChatGPT, Claude, Bard, and Llama-2 without having direct access to them. The examples shown here are all actual outputs of these systems. The adversarial prompt can elicit arbitrary harmful beha

... (truncated, 98 KB total)
Resource ID: 302c069146f3f6f2 | Stable ID: OTAwN2FkM2