Skip to content
Longterm Wiki
Back

Safety Misalignment Against Large Language Models

web

Published at NDSS 2025, this paper is relevant to researchers studying the robustness of LLM safety alignment and adversarial attacks; useful for understanding how alignment can be undermined and what defenses may help.

Metadata

Importance: 62/100conference paperprimary source

Summary

This paper presents the first comprehensive evaluation framework for safety misalignment attacks on LLMs, investigating system-prompt modification, fine-tuning, and model editing approaches. The authors introduce a novel Self-Supervised Representation Attack (SSRA) that achieves significant safety misalignment without requiring harmful training responses, and a corresponding defense (SSRD) that can re-align compromised models. Findings empirically demonstrate the fragility of current LLM safety alignment mechanisms.

Key Points

  • Provides a unified benchmark evaluating robustness of LLM safety alignment against diverse attack categories including prompt modification, fine-tuning, and model editing.
  • Supervised Fine-Tuning (SFT) is identified as the most potent misalignment attack but requires access to harmful model responses for training data.
  • Novel SSRA (Self-Supervised Representation Attack) achieves strong safety misalignment without needing harmful response data, lowering the attack barrier.
  • Proposed SSRD (Self-Supervised Representation Defense) can effectively re-align models after misalignment attacks, outperforming existing defenses.
  • Results highlight the fragility of safety alignment across multiple LLMs, raising concerns about the reliability of current alignment strategies.

Cited by 1 page

PageTypeQuality
AI AlignmentApproach91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Safety Misalignment Against Large Language Models

Yichen Gong1, Delong Ran2, Xinlei $\\mathrm { H e ^ { 3 } }$ , Tianshuo $\\mathrm { C o n g ^ { 4 ( \\boxtimes ) } }$ , Anyu $\\mathrm { W a n g ^ { 4 , 5 , 6 ( \\boxtimes ) } }$ , and Xiaoyun Wang4,5,6,7,8 1Department of Computer Science and Technology, Tsinghua University, 2Institute for Network Sciences and Cyberspace, Tsinghua University, 3Hong Kong University of Science and Technology (Guangzhou), 4Institute for Advanced Study, BNRist, Tsinghua University, 5Zhongguancun Laboratory, 6National Financial Cryptography Research Center, 7Shandong Institute of Blockchain, 8Key Laboratory of Cryptologic Technology and Information Security (Ministry of Education), School of Cyber Science and Technology, Shandong University E-mails: {gongyc18, rdl22} $@$ mails.tsinghua.edu.cn, xinleihe $@$ hkust-gz.edu.cn, {congtianshuo, anyuwang, xiaoyunwang} $@$ tsinghua.edu.cn

Abstract—The safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values. To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks. However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities. To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses. The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing. Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses. In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses. We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model. In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs.

# I. INTRODUCTION

With the emergence of powerful Large Language Models (LLMs) like ChatGPT \[1\] and Llama 2 \[2\], LLMs have been integrated into multifarious aspects of daily life, including smartphones \[3\] and chatbots \[4\], making them effortlessly accessible to people of all ages with diverse usage intentions. Notably, in August 2024, the European Union AI Act \[5\] officially came into effect, marking the first-ever legal regulatory framework for Artificial Intelligence (AI). This AI Act aims to regulate the safety of AI technologies. Consequentl

... (truncated, 98 KB total)
Resource ID: 3a7a904debb5b65f | Stable ID: N2VkNTMxMz