Language Models Resist Alignment (https://arxiv.org/abs/2406.06144)

paper

2024·arXiv·arxiv.org/abs/2406.06144

Authors

Jiaming Ji·Kaile Wang·Tianyi Qiu·Boyuan Chen·Jiayi Zhou·Changye Li·Hantao Lou·Juntao Dai·Yunhuai Liu·Yaodong Yang

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Highly relevant to AI safety debates about alignment robustness; provides theoretical grounding for observed fragility of RLHF/SFT alignment and implications for the difficulty of maintaining aligned behavior in deployed models.

Paper Details

Citations

4 influential

Year

2024

arXiv:2406.06144 DOI:10.18653/v1/2025.acl-long.1141 Semantic Scholar

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

Summary

This paper investigates why alignment fine-tuning is fragile, demonstrating empirically and theoretically that LLMs exhibit 'elasticity'—a tendency to revert to pre-training behavior distributions when further fine-tuned. Using compression theory, the authors show that fine-tuning disproportionately undermines alignment compared to pre-training, with elasticity increasing with model size and pre-training data volume.

Key Points

•LLMs exhibit 'elasticity': post-alignment models tend to revert to pre-training behavior distributions upon further fine-tuning, even on benign data.
•Compression theory formally shows fine-tuning undermines alignment disproportionately compared to pre-training, potentially by orders of magnitude.
•Elasticity positively correlates with model size and pre-training data scale, meaning larger models are harder to robustly align.
•Model performance declines rapidly before reverting to pre-training distribution, after which the decline rate drops significantly.
•Findings suggest current alignment methods (SFT, RLHF) may produce only superficial alignment, motivating research into more robust alignment techniques.

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2406.06144] Language Models Resist Alignment 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Language Models Resist Alignment

 
 
 
Jiaming Ji     Kaile Wang ∗     Tianyi Qiu ∗   Boyuan Chen ∗ 
 Jiayi Zhou    Changye Li    Hantao Lou    Yaodong Yang † 

 
 PKU-Alignment Team, Peking University
 Equal contributions, corresponding author. Code url: https://github.com/PKU-Alignment/llms-resist-alignment . 
 

 
 Abstract

 Large language models (LLMs) may exhibit undesirable behaviors.
Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally.
Do alignment fine-tuning have robust effects on models, or are merely superficial ?
In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the elasticity of post-alignment models, i.e. , the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process disproportionately undermines alignment compared to pre-training, potentially by orders of magnitude.
We conduct experimental validations to confirm the presence of elasticity across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that elasticity positively correlates with increased model size and the expansion of pre-training data.
Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.

 
 
 
 1 Introduction

 
 Large language models (LLMs) have exhibited remarkable capabilities  [ 1 , 2 ] . However, given the inevitable biases and harmful content in the training dataset  [ 3 , 4 ] , these models often exhibit behaviors that deviate from the designer’ intentions, a phenomenon we refer to as model misalignment . Therefore, aligning LLMs to ensure their behaviors remain consistent with human intentions and values is particularly important  [ 2 , 5 , 6 , 7 , 8 ] .

 
 
 Figure 1 : Forward and Inverse Alignment. LLMs undergo numerous iterations during pre-training, forming a stable parameter distribution. Subsequent alignment procedures fine-tune this distribution to reflect human intentions. Our research question is: During further fine-tuning, is it harder to deviate from the stable parameter distribution formed during pre-training than to maintain it? 
 
 
 So far, we mainly steer or align models with finetuning-based methods including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) [ 9 ] , and more [ 8 , 10 , 11 , 12 , 13 , 14 ] .
However, it remains unclear whether such methods truly penetrate the model representations 

... (truncated, 98 KB total)

Resource ID: 0b23a4115fcd80c0 | Stable ID: sid_wsrzsahnuG