Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

web

openreview.net·openreview.net/forum?id=EEWpE9cR27

Relevant to researchers working on multimodal AI safety, particularly the challenge of maintaining safety alignment when extending LLMs with vision capabilities, a growing concern as VLMs are widely deployed.

Metadata

Importance: 62/100conference paperprimary source

Summary

This ICLR 2025 paper investigates why vision-language models (VLMs) suffer safety alignment degradation when visual modalities are introduced, identifying the mechanisms by which visual inputs can bypass text-based safety training. The authors propose methods to diagnose and mitigate this degradation, strengthening safety alignment in multimodal AI systems.

Key Points

•Identifies the root causes of safety alignment degradation when language models are extended with visual capabilities
•Visual modality inputs can circumvent safety guardrails established during text-only alignment training
•Proposes concrete mitigation strategies to preserve safety alignment in vision-language models
•Relevant to the growing challenge of securing multimodal AI systems against jailbreaks and harmful outputs
•Presented at ICLR 2025, indicating peer-reviewed technical rigor in the safety research community

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202634 KB

[![back arrow](https://openreview.net/images/arrow_left.svg)Go to **ICLR 2025 Conference** homepage](https://openreview.net/group?id=ICLR.cc/2025/Conference "Venue Homepage")

4 Versions

- [University of California Publication Management System (December 31, 2024)](https://openreview.net/forum?id=16odXxzEE8)
- [ACL (Findings) 2025 (December 31, 2024)](https://openreview.net/forum?id=H9Dx9BoSjg)
- [Submitted to ICLR 2025 (September 23, 2024)](https://openreview.net/forum?id=EEWpE9cR27)
- [CoRR 2024 (December 31, 2023)](https://openreview.net/forum?id=veRieEb1eA)

## Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

[![Download PDF](https://openreview.net/images/pdf_icon_blue.svg)](https://openreview.net/pdf?id=EEWpE9cR27 "Download PDF")

### [Qin Liu](https://openreview.net/profile?id=~Qin_Liu5 ""), [Chao Shang](https://openreview.net/profile?id=~Chao_Shang1 ""), [Ling Liu](https://openreview.net/profile?id=~Ling_Liu6 ""), [Nikolaos Pappas](https://openreview.net/profile?id=~Nikolaos_Pappas1 ""), [Jie Ma](https://openreview.net/profile?id=~Jie_Ma3 ""), [Neha Anna John](https://openreview.net/profile?id=~Neha_Anna_John1 ""), [Srikanth Doss](https://openreview.net/profile?id=~Srikanth_Doss1 ""), [Lluis Marquez](https://openreview.net/profile?id=~Lluis_Marquez1 ""), [Miguel Ballesteros](https://openreview.net/profile?id=~Miguel_Ballesteros1 ""), [Yassine Benajiba](https://openreview.net/profile?id=~Yassine_Benajiba1 "")

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025Everyone[Revisions](https://openreview.net/revisions?id=EEWpE9cR27)[BibTeX](https://openreview.net/forum?id=EEWpE9cR27#)[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/ "")

**Keywords:** Safety Alignment, Multi-modality, AI Security

**TL;DR:** Investigate safety alignment degradation of VLMs and mitigate this phenomenon by inference-time representation manipulation, significantly improving safety alignment of VLMs without training.

**Abstract:**

The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as “safety alignment degradation” in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preservi

... (truncated, 34 KB total)

Resource ID: 84cb4bc2e07726c0 | Stable ID: sid_jIgyi2aN1k