"Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs"

paper

2025·arXiv·arxiv.org/abs/2501.16534

Authors

Jean-Charles Noirot Ferrand·Yohan Beugin·Eric Pauley·Ryan Sheatsley·Patrick McDaniel

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety researchers studying alignment robustness and adversarial attacks; demonstrates that current RLHF-style alignment may be structurally fragile and susceptible to targeted extraction and transfer attacks.

Paper Details

Citations

0 influential

Year

2025

arXiv:2501.16534 DOI:10.48550/arXiv.2501.16534 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks. The code is available at https://github.com/jcnf0/targeting-alignment.

Summary

This paper presents a method for extracting surrogate classifiers that approximate the internal safety mechanisms of aligned LLMs, using only 20-50% of model parameters while achieving >80% F1 agreement with the original model's refusal decisions. The extracted surrogates enable highly effective transfer attacks, achieving 70% success rates compared to 22% for direct attacks, exposing structural vulnerabilities in current alignment approaches.

Key Points

•Surrogate safety classifiers can be extracted from subsets (20-50%) of LLM parameters while maintaining >80% F1 agreement with the full model's safety decisions.
•Transfer attacks using extracted surrogate classifiers achieve ~70% success rate vs ~22% for direct attacks, demonstrating significant transferability of adversarial examples.
•The method treats LLM alignment as a localized classification problem, enabling systematic analysis of where and how safety decisions are encoded in model architecture.
•Findings reveal that alignment mechanisms are concentrated in identifiable substructures, making them potentially more vulnerable to targeted adversarial exploitation.
•The approach has dual-use implications: useful for robustness evaluation and red-teaming, but also for informing stronger alignment methods that distribute safety reasoning more broadly.

Cited by 1 page

Page	Type	Quality
AI Alignment	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202679 KB

[2501.16534] Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

 
 
 
Jean-Charles Noirot Ferrand
 University of Wisconsin-Madison
 jcnf@cs.wisc.edu
 
    
 Yohan Beugin
 University of Wisconsin-Madison
 ybeugin@cs.wisc.edu
 
    
 Eric Pauley
 University of Wisconsin-Madison
 epauley@cs.wisc.edu
 
    
 Ryan Sheatsley
 University of Wisconsin-Madison
 sheatsley@cs.wisc.edu
 
    
 Patrick McDaniel
 University of Wisconsin-Madison
 mcdaniel@cs.wisc.edu
 
 

 
 Abstract

 Alignment in large language models (LLMs) is used to enforce guidelines such as safety.
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs.
In this paper, we present and evaluate a method to assess the robustness of LLM alignment. We observe that alignment embeds a safety classifier in the target model that is responsible for deciding between refusal and compliance. We seek to extract an approximation of this classifier, called a surrogate classifier, from the LLM.
We develop an algorithm for identifying candidate classifiers from subsets of the LLM model. We evaluate the degree to which the candidate classifiers approximate the model’s embedded classifier in benign ( F 1 F_{1} score) and adversarial (using surrogates in a white-box attack) settings.
Our evaluation shows that the best candidates achieve accurate agreement (an F 1 F_{1} score above 80%) using as little as 20% of the model architecture. Further, we find attacks mounted on the surrogate models can be transferred with high accuracy. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70%–a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR.
These results show that extracting surrogate classifiers is a viable (and highly effective) means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

 
 
 
 1 Introduction

 
 Improvements in machine learning–specifically the introduction of the transformer architectures  [ 44 ] –have led to increasingly powerful large language models (LLMs). Whether they are accessed through an API or interface (e.g., GPT  [ 32 ] , Gemini  [ 39 ] , Claude  [ 1 ] ) or published as open-source (e.g., Llama 2  [ 29 ] , Llama 3  [ 30 ] , Qwen  [ 43 ] , etc.), these models have become the de facto tool for tasks involving natural language. They serve as foundations for new tools, in which they are augmented with new capabilities  [ 2 , 31 ] or fine-tuned on a downstream task  [ 14 ] .
In adapting these models to tasks, LLMs undergo an alignment process in which they are further trained (refined) to satisfy safety objectives  [ 29 , 36 ] . For example, alignment guidelines often prevent a model from outputting responses that would be offensive, discriminatory, or harmful (i.e., unsafe). It is ackno

... (truncated, 79 KB total)

Resource ID: b97d5c85dbbacca3 | Stable ID: sid_DszAr1gxAH