"Unforgotten Safety: Preserving Safety Alignment of LLMs with Continual Learning"

paper

2025·arXiv·arxiv.org/html/2512.10150v1

Authors

Lama Alssum·Hani Itani·Hasan Abed Al Kader Hammoud·Philip Torr·Adel Bibi·Bernard Ghanem

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to practitioners fine-tuning safety-aligned LLMs on custom datasets; proposes continual learning as a principled mitigation against safety regression introduced by standard fine-tuning workflows.

Paper Details

Citations

0 influential

Year

2025

arXiv:2512.10150 DOI:10.48550/arXiv.2512.10150 Semantic Scholar

Metadata

Importance: 62/100arxiv preprintprimary source

Abstract

The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

Summary

This paper reframes post-training safety degradation in LLMs as a catastrophic forgetting problem and systematically evaluates continual learning (CL) methods to preserve safety alignment during fine-tuning. Across three model families and multiple downstream tasks, CL approaches—especially Dark Experience Replay (DER)—consistently lower attack success rates versus standard fine-tuning while maintaining task utility. The findings hold even under adversarial conditions where training data contains poisoned harmful samples.

Key Points

•Safety degradation during fine-tuning is attributed to catastrophic forgetting of safety-aligned behaviors, motivating a continual learning framing.
•Multiple CL strategies (regularization-based, memory-based, model merging) are evaluated; DER (Dark Experience Replay) performs most consistently.
•Experiments span three LLM families and multiple downstream tasks, providing broad empirical coverage of the proposed approach.
•CL methods reduce attack success rates (ASR) without substantially sacrificing downstream task performance—a key practical tradeoff.
•The approach is robust to data poisoning scenarios where harmful samples are injected into user fine-tuning datasets.

Cited by 1 page

Page	Type	Quality
AI Alignment	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202643 KB

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning 
 
 
 

 
 

 
 
 
 
 Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

 
 
 Lama Alssum †,1,*  Hani Itani 1,* 
 Hasan Abed Al Kader Hammoud 1 
 Philip Torr 2 
 
 Adel Bibi 2 
  Bernard Ghanem 1 
 
 
 
 Abstract

 The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user’s selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

 
 1 1 footnotetext: King Abdullah University of Science and Technology 2 2 footnotetext: University of Oxford 1 1 footnotetext: These authors contributed equally 2 2 footnotetext: Corresponding author: lama.alssum.1@kaust.edu.sa 
 
 
 1 Introduction

 
 Large language models (LLMs) have achieved remarkable capabilities across different domains. However, capability alone without rigorous safety alignment is insufficient for responsible deployment. To address LLM safety, models are aligned prior to deployment using supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) christiano2017deep , or direct preference optimization (DPO) rafailov2023direct . However, subsequent fine-tuning of an aligned model on additional data can degrade its safety, causing the model to forget or override safe behavior qi2023fine ; lermen2023lora . As an emerging paradigm, LLM fine-tuning services enable end users to upload their own data for fine-tuning. Fine-tuning can erode the already learned safe behavior, even when the data is entirely benign. Moreover, a few harmful samples mixed into benign data can drastically compromise alignment. Consequently, there is an urgent need for an effective mitigation strategy since the providers of these services are responsible for the output of the fine-tuned models.

 
 
 In this context, we observe that safety degradat

... (truncated, 43 KB total)

Resource ID: dfac0b0fc5650aa5 | Stable ID: sid_uWElK94690