Skip to content
Longterm Wiki
Back

The Unintended Trade-off of AI Alignment: Balancing Hallucination ...

paper

Authors

Omar Mahmoud·Ali Khalil·Thommen George Karimpanal·Buddhika Laknath Semage·Santu Rana

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper identifies a critical trade-off in LLM alignment where efforts to reduce hallucinations inadvertently weaken safety mechanisms, proposing sparse autoencoders as a solution to disentangle these competing objectives.

Paper Details

Citations
0
0 influential
Year
2026
Methodology
peer-reviewed
Categories
Findings of the Association for Computational Ling

Metadata

arxiv preprintprimary source

Summary

This paper identifies and addresses a critical trade-off in LLM alignment: efforts to mitigate hallucinations and improve factual accuracy often inadvertently weaken safety alignment and refusal behavior. The authors demonstrate that hallucination and refusal information are encoded in overlapping model components, causing alignment methods to suppress factual knowledge unintentionally. They propose a solution using sparse autoencoders to disentangle refusal-related features from hallucination features, combined with subspace orthogonalization during fine-tuning to preserve safety alignment while maintaining truthfulness. Evaluation on commonsense reasoning and harmful benchmarks shows their method successfully mitigates the truthfulness-safety trade-off.

Cited by 1 page

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[License: CC BY-SA 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2510.07775v2 \[cs.CL\] 30 Jan 2026

# ![[Uncaptioned image]](https://arxiv.org/html/2510.07775v2/x1.png) The Unintended Trade-off of AI Alignment:    Balancing Hallucination Mitigation and Safety in LLMs   This paper contains text that might be offensive.

Report issue for preceding element

Omar Mahmoud∗, Ali Khalil∗ , Buddhika Laknath Semage†

Thommen George Karimpanal‡ , Santu Rana∗

∗Applied Artificial Intelligence Initiative, Deakin University, Australia

‡School of Information Technology, Deakin University, Australia

†Independent

o.mahmoud@deakin.edu.au

Report issue for preceding element

###### Abstract

Report issue for preceding element

Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally.
We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.111 [https://github.com/OmarMohammed88/Hall\_Refusal](https://github.com/OmarMohammed88/Hall_Refusal "")

Report issue for preceding element

![[Uncaptioned image]](https://arxiv.org/html/2510.07775v2/x2.png)
The Unintended Trade-off of AI Alignment:

Balancing Hallucination Mitigation and Safety in LLMs

This paper contains text that might be offensive.

Report issue for preceding element

Omar Mahmoud∗, Ali Khalil∗ , Buddhika Laknath Semage†Thommen George Karimpanal‡ , Santu Rana∗∗Applied Artificial Intelligence Initiative, Deakin University, Australia‡School of Information Technology, Deakin University, Australia†Independento.mahmoud@deakin.edu.au

Report issue for preceding element

![Refer to caption](https://arxiv.org/html/2510.07775v2/x3.png)Figure 1: The truthfulness–safety trade-off. Interventions that improve truthfulness—such as head steering, probing, or representation mapping—can unintenti

... (truncated, 98 KB total)
Resource ID: cfd4dd7a40a56031 | Stable ID: NGZjMmVmZG