"Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs"
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant to AI safety researchers studying alignment robustness and adversarial attacks; demonstrates that current RLHF-style alignment may be structurally fragile and susceptible to targeted extraction and transfer attacks.
Paper Details
Metadata
Abstract
Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks. The code is available at https://github.com/jcnf0/targeting-alignment.
Summary
This paper presents a method for extracting surrogate classifiers that approximate the internal safety mechanisms of aligned LLMs, using only 20-50% of model parameters while achieving >80% F1 agreement with the original model's refusal decisions. The extracted surrogates enable highly effective transfer attacks, achieving 70% success rates compared to 22% for direct attacks, exposing structural vulnerabilities in current alignment approaches.
Key Points
- •Surrogate safety classifiers can be extracted from subsets (20-50%) of LLM parameters while maintaining >80% F1 agreement with the full model's safety decisions.
- •Transfer attacks using extracted surrogate classifiers achieve ~70% success rate vs ~22% for direct attacks, demonstrating significant transferability of adversarial examples.
- •The method treats LLM alignment as a localized classification problem, enabling systematic analysis of where and how safety decisions are encoded in model architecture.
- •Findings reveal that alignment mechanisms are concentrated in identifiable substructures, making them potentially more vulnerable to targeted adversarial exploitation.
- •The approach has dual-use implications: useful for robustness evaluation and red-teaming, but also for informing stronger alignment methods that distribute safety reasoning more broadly.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Alignment | Approach | 91.0 |
Cached Content Preview
[2501.16534] Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Jean-Charles Noirot Ferrand
University of Wisconsin-Madison
jcnf@cs.wisc.edu
Yohan Beugin
University of Wisconsin-Madison
ybeugin@cs.wisc.edu
Eric Pauley
University of Wisconsin-Madison
epauley@cs.wisc.edu
Ryan Sheatsley
University of Wisconsin-Madison
sheatsley@cs.wisc.edu
Patrick McDaniel
University of Wisconsin-Madison
mcdaniel@cs.wisc.edu
Abstract
Alignment in large language models (LLMs) is used to enforce guidelines such as safety.
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs.
In this paper, we present and evaluate a method to assess the robustness of LLM alignment. We observe that alignment embeds a safety classifier in the target model that is responsible for deciding between refusal and compliance. We seek to extract an approximation of this classifier, called a surrogate classifier, from the LLM.
We develop an algorithm for identifying candidate classifiers from subsets of the LLM model. We evaluate the degree to which the candidate classifiers approximate the model’s embedded classifier in benign ( F 1 F_{1} score) and adversarial (using surrogates in a white-box attack) settings.
Our evaluation shows that the best candidates achieve accurate agreement (an F 1 F_{1} score above 80%) using as little as 20% of the model architecture. Further, we find attacks mounted on the surrogate models can be transferred with high accuracy. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70%–a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR.
These results show that extracting surrogate classifiers is a viable (and highly effective) means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.
1 Introduction
Improvements in machine learning–specifically the introduction of the transformer architectures [ 44 ] –have led to increasingly powerful large language models (LLMs). Whether they are accessed through an API or interface (e.g., GPT [ 32 ] , Gemini [ 39 ] , Claude [ 1 ] ) or published as open-source (e.g., Llama 2 [ 29 ] , Llama 3 [ 30 ] , Qwen [ 43 ] , etc.), these models have become the de facto tool for tasks involving natural language. They serve as foundations for new tools, in which they are augmented with new capabilities [ 2 , 31 ] or fine-tuned on a downstream task [ 14 ] .
In adapting these models to tasks, LLMs undergo an alignment process in which they are further trained (refined) to satisfy safety objectives [ 29 , 36 ] . For example, alignment guidelines often prevent a model from outputting responses that would be offensive, discriminatory, or harmful (i.e., unsafe). It is ackno
... (truncated, 79 KB total)b97d5c85dbbacca3 | Stable ID: NmJkNzBjMm