Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (https://arxiv.org/abs/2406.11717)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper reveals that LLM refusal behavior is controlled by a single interpretable direction in activation space, enabling both mechanistic understanding of safety mechanisms and development of jailbreak techniques—critical for understanding vulnerabilities in model alignment.
Paper Details
Metadata
Summary
This paper demonstrates that refusal behavior in large language models is mediated by a single one-dimensional direction in the model's activation space, consistent across 13 popular open-source chat models up to 72B parameters. The authors identify this 'refusal direction' and show that erasing it prevents models from refusing harmful requests while amplifying it causes refusal on benign instructions. They leverage this finding to develop a white-box jailbreak method that surgically disables refusal with minimal impact on other capabilities, and mechanistically analyze how adversarial suffixes suppress the refusal direction. The work highlights the brittleness of current safety fine-tuning approaches and demonstrates how mechanistic interpretability can be used to control model behavior.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
| Representation Engineering | Approach | 72.0 |
Cached Content Preview
# Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi∗
Independent
&Oscar Obeso∗
ETH Zürich
&Aaquib Syed
University of Maryland
&Daniel Paleka
ETH Zürich
Nina Rimsky
Anthropic
&Wes Gurnee
MIT
&Neel Nanda
###### Abstract
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones.
While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood.
In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
Specifically, for each model, we find a single direction such that erasing this direction from the model’s residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.
Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction.
Our findings underscore the brittleness of current safety fine-tuning methods.
More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.222Code available at [https://github.com/andyrdt/refusal\_direction](https://github.com/andyrdt/refusal_direction "").
11footnotetext: Correspondence to [andyrdt@gmail.com](mailto:andyrdt@gmail.com ""), [obalcells@student.ethz.ch](mailto:obalcells@student.ethz.ch "").
## 1 Introduction
Deployed large language models (LLMs) undergo multiple rounds of fine-tuning to become both helpful and harmless: to provide helpful responses to innocuous user requests, but to refuse harmful or inappropriate ones (Bai et al., [2022](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib7 "")).
Naturally, large numbers of users and researchers alike have attempted to circumvent these defenses using a wide array of jailbreak attacks (Wei et al., [2023](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib85 ""); Xu et al., [2024](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib90 ""); Chu et al., [2024](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib17 "")) to
uncensor model outputs,
including fine-tuning techniques (Yang et al., [2023](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib91 ""); Lermen et al., [2023](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib37 ""); Zhan et al., [2023](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib94 "")).
While the consequences of a successful attack on current chat assistants are modest, the scale and severity of harm from misuse could increase dramatically if frontier models are endowed with increased agency and autonomy (Anthropic, [2024](https://ar5iv.labs.arxiv.org/html/2406.11717#bib.bib4 ""))
... (truncated, 98 KB total)ae4bb1285386c3e1 | Stable ID: YjY3Y2IyMG