HalluLens, ACL 2025 (https://aclanthology.org/2025.acl-long.1176/)

web

aclanthology.org·aclanthology.org/2025.acl-long.1176/

Published at ACL 2025, HalluLens is a structured hallucination benchmark relevant to AI safety researchers concerned with LLM reliability, trustworthiness, and the practical challenges of measuring and mitigating hallucination in deployed systems.

Metadata

Importance: 62/100conference paperdataset

Summary

HalluLens introduces a comprehensive benchmark for evaluating LLM hallucinations, distinguishing between extrinsic hallucinations (content deviating from training data) and intrinsic hallucinations, built on a clear taxonomy. It introduces three new extrinsic hallucination tasks with dynamic test set generation to prevent data leakage and improve robustness. The benchmark addresses the fragmented state of hallucination research by providing a unified framework and publicly releasing the codebase.

Key Points

•Proposes a taxonomy separating 'hallucination' from 'factuality', and distinguishing extrinsic vs. intrinsic hallucinations for research consistency.
•Introduces three new extrinsic hallucination evaluation tasks with dynamic test set generation to mitigate data contamination risks.
•Emphasizes extrinsic hallucinations—where outputs deviate from training data—as increasingly critical as LLMs become more capable.
•Fills a gap in the benchmark landscape: no prior benchmark was solely dedicated to extrinsic hallucinations.
•Releases open codebase enabling reproducible and extensible hallucination benchmarking by the research community.

Cited by 1 page

Page	Type	Quality
Large Language Models	Capability	60.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202614 KB

HalluLens: LLM Hallucination Benchmark - ACL Anthology H allu L ens: LLM Hallucination Benchmark 

 Yejin Bang ,
 Ziwei Ji ,
 Alan Schelten ,
 Anthony Hartshorn ,
 Tara Fowler ,
 Cheng Zhang ,
 Nicola Cancedda ,
 Pascale Fung 

 Correct Metadata for 

 Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account.
Once you create that issue, the correction will be reviewed by a staff member. ⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app. Important : The Anthology treat PDFs as authoritative. Please use this form only to correct data
that is out of line with the PDF. See our corrections
guidelines if you need to change the PDF. Title 
 Adjust the title. Retain tags such as
<fixed-case>. 
 Authors 
 Adjust author names and order to match the
PDF. Add Author Abstract 
 Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold , <i>...</i> for italic , and <url>...</url> for URLs. 
 Verification against PDF 
 Ensure that the new title/authors match the snapshot below. (If there
is no snapshot or it is too small, consult the PDF .) Authors concatenated from the text boxes above: 
 ALL author names match the snapshot above—including
middle initials, hyphens, and accents. Create GitHub issue for staff review Abstract

 Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as “hallucination.” These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is important for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark HalluLens, incorporating both extrinsic and intrinsic evaluation tasks, built upon a clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from “factuality” and propose a taxonomy distinguishing extrinsic and intrinsic hallucinations to promote consistency and facilitate research. We emphasize extrinsic hallucinations – where generated content deviates from training data – as they become increasingly relevant with LLM advancements. However, no benchmark is solely dedicated to extrinsic hallucinations. To address this gap, HalluLens introduces three new extrinsic tasks with dynamic test set generation to mitigate data leakage and ensure robustness. We release codebase for extrinsic hallucination benchmark. Anthology ID: 2025.acl-long.1176 Volume: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Month: July Year: 2025 Address: Vienna, Austria Editors: Wanxiang Che ,
 Joyce Nabende ,
 Ekaterina Shutova ,
 Mohammad Taher Pilehvar Venue: ACL SIG: Publisher: Association for Computational Linguistics Note: Pa

... (truncated, 14 KB total)

Resource ID: 351b4e7354e2dc5b | Stable ID: sid_PMEnP9jJod