Safety Misalignment Against Large Language Models

web

ndss-symposium.org·ndss-symposium.org/wp-content/uploads/2025-1089-paper.pdf

Published at NDSS 2025, this paper is relevant to researchers studying the robustness of LLM safety alignment and adversarial attacks; useful for understanding how alignment can be undermined and what defenses may help.

Metadata

Importance: 62/100conference paperprimary source

Summary

This paper presents the first comprehensive evaluation framework for safety misalignment attacks on LLMs, investigating system-prompt modification, fine-tuning, and model editing approaches. The authors introduce a novel Self-Supervised Representation Attack (SSRA) that achieves significant safety misalignment without requiring harmful training responses, and a corresponding defense (SSRD) that can re-align compromised models. Findings empirically demonstrate the fragility of current LLM safety alignment mechanisms.

Key Points

•Provides a unified benchmark evaluating robustness of LLM safety alignment against diverse attack categories including prompt modification, fine-tuning, and model editing.
•Supervised Fine-Tuning (SFT) is identified as the most potent misalignment attack but requires access to harmful model responses for training data.
•Novel SSRA (Self-Supervised Representation Attack) achieves strong safety misalignment without needing harmful response data, lowering the attack barrier.
•Proposed SSRD (Self-Supervised Representation Defense) can effectively re-align models after misalignment attacks, outperforming existing defenses.
•Results highlight the fragility of safety alignment across multiple LLMs, raising concerns about the reliability of current alignment strategies.

Cited by 1 page

Page	Type	Quality
AI Alignment	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Safety Misalignment Against Large
Language Models
Yichen Gong1, Delong Ran2, Xinlei He3, Tianshuo Cong4(), Anyu Wang4,5,6(), and Xiaoyun Wang4,5,6,7,8
1Department of Computer Science and Technology, Tsinghua University,
2Institute for Network Sciences and Cyberspace, Tsinghua University,
3Hong Kong University of Science and Technology (Guangzhou), 4Institute for Advanced Study, BNRist, Tsinghua University,
5Zhongguancun Laboratory, 6National Financial Cryptography Research Center, 7Shandong Institute of Blockchain,
8Key Laboratory of Cryptologic Technology and Information Security (Ministry of Education),
School of Cyber Science and Technology, Shandong University
E-mails: {gongyc18, rdl22}@mails.tsinghua.edu.cn, xinleihe@hkust-gz.edu.cn,
{congtianshuo, anyuwang, xiaoyunwang}@tsinghua.edu.cn
Abstract—The safety alignment of Large Language Models
(LLMs) is crucial to prevent unsafe content that violates human
values. To ensure this, it is essential to evaluate the robustness of
their alignment against diverse malicious attacks. However, the
lack of a large-scale, unified measurement framework hinders a
comprehensive understanding of potential vulnerabilities. To fill
this gap, this paper presents the first comprehensive evaluation
of existing and newly proposed safety misalignment methods
for LLMs. Specifically, we investigate four research questions:
(1) evaluating the robustness of LLMs with different alignment
strategies, (2) identifying the most effective misalignment method,
(3) determining key factors that influence misalignment effective-
ness, and (4) exploring various defenses. The safety misalignment
attacks in our paper include system-prompt modification, model
fine-tuning, and model editing. Our findings show that Supervised
Fine-Tuning is the most potent attack but requires harmful model
responses. In contrast, our novel Self-Supervised Representa-
tion Attack (SSRA) achieves significant misalignment without
harmful responses. We also examine defensive mechanisms such
as safety data filter, model detoxification, and our proposed
Self-Supervised Representation Defense (SSRD), demonstrating
that SSRD can effectively re-align the model. In conclusion,
our unified safety alignment evaluation framework empirically
highlights the fragility of the safety alignment of LLMs.
I. INTRODUCTION
With the emergence of powerful Large Language Models
(LLMs) like ChatGPT [1] and Llama 2 [2], LLMs have been
integrated into multifarious aspects of daily life, including
smartphones [3] and chatbots [4], making them effortlessly
accessible to people of all ages with diverse usage intentions.
Notably, in August 2024, the European Union AI Act [5]
officially came into effect, marking the first-ever legal regu-
latory framework for Artificial Intelligence (AI). This AI Act
()Tianshuo Cong and Anyu Wang are the corresponding authors.
Safety Guardrail
LLM
Safety Misalignment
System-Prompt Modification
Model Fine-tuning
Model Editing
LLM
Harmful Queries
Refusal 

... (truncated, 98 KB total)

Resource ID: 3a7a904debb5b65f | Stable ID: sid_p8ADVGxuDm