Red Teaming Language Models

paper

2022·arXiv·arxiv.org/abs/2202.03286

Authors

Ethan Perez·Saffron Huang·Francis Song·Trevor Cai·Roman Ring·John Aslanides·Amelia Glaese·Nat McAleese·Geoffrey Irving

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces automated red teaming techniques using language models to identify harmful behaviors in target LMs before deployment, addressing limitations of manual human annotation for safety testing.

Paper Details

Citations

70 influential

Year

2024

arXiv:2202.03286 DOI:10.36227/techrxiv.172599483.32667750/v1 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.

Summary

This paper introduces an automated red teaming approach for identifying harmful behaviors in large language models by using another LM to generate adversarial test cases rather than relying on expensive human annotation. The authors evaluate a 280B parameter chatbot using a classifier to detect offensive content and discover tens of thousands of harmful outputs. They explore multiple generation methods ranging from zero-shot to reinforcement learning, and demonstrate the approach can uncover diverse harms including offensive content about groups, privacy leaks, and harmful multi-turn conversations. The work presents LM-based red teaming as a scalable complementary tool for pre-deployment safety testing.

Cited by 1 page

Page	Type	Quality
Large Language Models	Concept	62.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2202.03286] Red Teaming Language Models with Language Models WARNING: This paper contains model outputs which are offensive in nature. 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Red Teaming Language Models with Language Models
 WARNING: This paper contains model outputs which are offensive in nature. 

 
 
 Ethan Perez 1 2    Saffron Huang 1    Francis Song 1    Trevor Cai 1    Roman Ring 1 
 John Aslanides 1    Amelia Glaese 1    Nat McAleese 1    Geoffrey Irving 1 
 1 DeepMind, 2 New York University
 perez@nyu.edu 
 
 
 

 
 Abstract

 Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ( “red teaming” ) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.

 
 
 
 1 Introduction

 
 Figure 1: Overview : We automatically generate test cases with a language model (LM), reply with the target LM, and find failing test cases using a classifier. 
 
 
 
 
 Although we had prepared for many types of abuses of the system, we had made a critical oversight for this specific attack. 

 
 
 Lee ( 2016 ) 

 
 
 
 
 Language Models (LMs) are promising tools for a variety of applications, ranging from conversational assistants to question-answering systems.
However, deploying LMs in production threatens to harm users in hard-to-predict ways.
For example, Microsoft took down its chatbot Tay after adversarial users evoked it into sending racist and sexually-charged tweets to over 50,000 followers  Lee ( 2016 ) .
Other work has found that LMs generate misinformation  Lin et al. ( 2021 ) and confidential, personal information (e.g., social security numbers) from the LM training corpus  Carlini et al. ( 2019 , 2021 ) .
Such failures have serious consequences, so it is crucial to discover and fix these failures before deployment.

 
 
 Prior work requires hu

... (truncated, 98 KB total)

Resource ID: 3959564c6c0768fe | Stable ID: sid_o5JnVXnhfp