The Unintended Trade-off of AI Alignment: Balancing Hallucination ...
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper identifies a critical trade-off in LLM alignment where efforts to reduce hallucinations inadvertently weaken safety mechanisms, proposing sparse autoencoders as a solution to disentangle these competing objectives.
Paper Details
Metadata
Summary
This paper identifies and addresses a critical trade-off in LLM alignment: efforts to mitigate hallucinations and improve factual accuracy often inadvertently weaken safety alignment and refusal behavior. The authors demonstrate that hallucination and refusal information are encoded in overlapping model components, causing alignment methods to suppress factual knowledge unintentionally. They propose a solution using sparse autoencoders to disentangle refusal-related features from hallucination features, combined with subspace orthogonalization during fine-tuning to preserve safety alignment while maintaining truthfulness. Evaluation on commonsense reasoning and harmful benchmarks shows their method successfully mitigates the truthfulness-safety trade-off.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Reducing Hallucinations in AI-Generated Wiki Content | Approach | 68.0 |
Cached Content Preview
[License: CC BY-SA 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)
arXiv:2510.07775v2 \[cs.CL\] 30 Jan 2026
# ![[Uncaptioned image]](https://arxiv.org/html/2510.07775v2/x1.png) The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs This paper contains text that might be offensive.
Report issue for preceding element
Omar Mahmoud∗, Ali Khalil∗ , Buddhika Laknath Semage†
Thommen George Karimpanal‡ , Santu Rana∗
∗Applied Artificial Intelligence Initiative, Deakin University, Australia
‡School of Information Technology, Deakin University, Australia
†Independent
o.mahmoud@deakin.edu.au
Report issue for preceding element
###### Abstract
Report issue for preceding element
Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally.
We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.111 [https://github.com/OmarMohammed88/Hall\_Refusal](https://github.com/OmarMohammed88/Hall_Refusal "")
Report issue for preceding element
![[Uncaptioned image]](https://arxiv.org/html/2510.07775v2/x2.png)
The Unintended Trade-off of AI Alignment:
Balancing Hallucination Mitigation and Safety in LLMs
This paper contains text that might be offensive.
Report issue for preceding element
Omar Mahmoud∗, Ali Khalil∗ , Buddhika Laknath Semage†Thommen George Karimpanal‡ , Santu Rana∗∗Applied Artificial Intelligence Initiative, Deakin University, Australia‡School of Information Technology, Deakin University, Australia†Independento.mahmoud@deakin.edu.au
Report issue for preceding element
Figure 1: The truthfulness–safety trade-off. Interventions that improve truthfulness—such as head steering, probing, or representation mapping—can unintenti
... (truncated, 98 KB total)cfd4dd7a40a56031 | Stable ID: NGZjMmVmZG