[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models

paper

2023·arXiv·arxiv.org/abs/2307.15043

Authors

Andy Zou·Zifan Wang·Nicholas Carlini·Milad Nasr·J. Zico Kolter·Matt Fredrikson

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Research paper presenting automated methods for generating universal adversarial attacks against aligned language models, demonstrating vulnerabilities in current alignment techniques and contributing to understanding of AI safety challenges.

Paper Details

Citations

Year

2025

DOI:10.2139/ssrn.5445960

Metadata

arxiv preprintprimary source

Abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

Summary

This paper presents an automated method for generating adversarial suffixes that can jailbreak aligned large language models, causing them to produce objectionable content. Rather than relying on manual engineering, the approach uses greedy and gradient-based search techniques to find universal attack suffixes that can be appended to harmful queries. Remarkably, these adversarial suffixes demonstrate strong transferability across different models and architectures, successfully inducing harmful outputs in both closed-source systems (ChatGPT, Bard, Claude) and open-source models (LLaMA-2-Chat, Pythia, Falcon). This work significantly advances adversarial attack capabilities against aligned LLMs and highlights critical vulnerabilities in current safety alignment approaches.

Cited by 3 pages

Page	Type	Quality
Adversarial Training	Approach	58.0
AI Alignment	Approach	91.0
AI Evaluation	Approach	72.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202686 KB

[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 Universal and Transferable Adversarial Attacks 
 on Aligned Language Models 

 
 
 Andy Zou 1,2 , Zifan Wang 2 , Nicholas Carlini 3 , Milad Nasr 3 , 
 J. Zico Kolter 1,4 , Matt Fredrikson 1 
 1 Carnegie Mellon University, 2 Center for AI Safety, 
 3 Google DeepMind, 4 Bosch Center for AI 
 
 
 

 
 Abstract

 Because “out-of-the-box” large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures—so-called “jailbreaks” against LLMs—these attacks have required significant human ingenuity and are brittle in practice. Attempts at automatic adversarial prompt generation have also achieved limited success. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.

 Surprisingly, we find that the adversarial prompts generated by our approach are highly transferable , including to black-box, publicly released, production LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix induces objectionable content in the public interfaces to ChatGPT, Bard, and Claude , as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. Interestingly, the success rate of this attack transfer is much higher against the GPT-based models, potentially owing to the fact that Vicuna itself is trained on outputs from ChatGPT. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information.
Code is available at github.com/llm-attacks/llm-attacks .

 
 
 
 
 1 Introduction

 
 Figure 1 : Aligned LLMs are not adversarially aligned. Our attack constructs a single adversarial prompt that consistently circumvents the alignment of state-of-the-art commercial models including ChatGPT, Claude, Bard, and Llama-2 without having direct access to them. The examples shown here are all actual outputs of these systems. The adversarial prompt can 

... (truncated, 86 KB total)

Resource ID: 302c069146f3f6f2 | Stable ID: sid_hOtrnDTni4