Back
HalluLens, ACL 2025 (https://aclanthology.org/2025.acl-long.1176/)
webaclanthology.org·aclanthology.org/2025.acl-long.1176/
Published at ACL 2025, HalluLens is a structured hallucination benchmark relevant to AI safety researchers concerned with LLM reliability, trustworthiness, and the practical challenges of measuring and mitigating hallucination in deployed systems.
Metadata
Importance: 62/100conference paperdataset
Summary
HalluLens introduces a comprehensive benchmark for evaluating LLM hallucinations, distinguishing between extrinsic hallucinations (content deviating from training data) and intrinsic hallucinations, built on a clear taxonomy. It introduces three new extrinsic hallucination tasks with dynamic test set generation to prevent data leakage and improve robustness. The benchmark addresses the fragmented state of hallucination research by providing a unified framework and publicly releasing the codebase.
Key Points
- •Proposes a taxonomy separating 'hallucination' from 'factuality', and distinguishing extrinsic vs. intrinsic hallucinations for research consistency.
- •Introduces three new extrinsic hallucination evaluation tasks with dynamic test set generation to mitigate data contamination risks.
- •Emphasizes extrinsic hallucinations—where outputs deviate from training data—as increasingly critical as LLMs become more capable.
- •Fills a gap in the benchmark landscape: no prior benchmark was solely dedicated to extrinsic hallucinations.
- •Releases open codebase enabling reproducible and extensible hallucination benchmarking by the research community.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20264 KB
## [HalluLens: LLM Hallucination Benchmark](https://aclanthology.org/2025.acl-long.1176.pdf)
[Yejin Bang](https://aclanthology.org/people/yejin-bang/unverified/),
[Ziwei Ji](https://aclanthology.org/people/ziwei-ji/),
[Alan Schelten](https://aclanthology.org/people/alan-schelten/unverified/),
[Anthony Hartshorn](https://aclanthology.org/people/anthony-hartshorn/unverified/),
[Tara Fowler](https://aclanthology.org/people/tara-fowler/unverified/),
[Cheng Zhang](https://aclanthology.org/people/cheng-zhang/),
[Nicola Cancedda](https://aclanthology.org/people/nicola-cancedda/unverified/),
[Pascale Fung](https://aclanthology.org/people/pascale-fung/unverified/)
* * *
##### Abstract
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as “hallucination.” These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is important for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark HalluLens, incorporating both extrinsic and intrinsic evaluation tasks, built upon a clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from “factuality” and propose a taxonomy distinguishing extrinsic and intrinsic hallucinations to promote consistency and facilitate research. We emphasize extrinsic hallucinations – where generated content deviates from training data – as they become increasingly relevant with LLM advancements. However, no benchmark is solely dedicated to extrinsic hallucinations. To address this gap, HalluLens introduces three new extrinsic tasks with dynamic test set generation to mitigate data leakage and ensure robustness. We release codebase for extrinsic hallucination benchmark.
Anthology ID:2025.acl-long.1176Volume:[Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)](https://aclanthology.org/volumes/2025.acl-long/)Month:JulyYear:2025Address:Vienna, AustriaEditors:[Wanxiang Che](https://aclanthology.org/people/wanxiang-che/unverified/),
[Joyce Nabende](https://aclanthology.org/people/joyce-nabende/unverified/),
[Ekaterina Shutova](https://aclanthology.org/people/ekaterina-shutova/unverified/),
[Mohammad Taher Pilehvar](https://aclanthology.org/people/mohammad-taher-pilehvar/unverified/)Venue:[ACL](https://aclanthology.org/venues/acl/ "Annual Meeting of the Association for Computational Linguistics")SIG:Publisher:Association for Computational LinguisticsNote:Pages:24128–24156Language:URL:[https://aclanthology.org/2025.acl-long.1176/](https://aclanthology.org/2025.acl-long.1176/)DOI:[10.18653/v1/2025.acl-long.1176](https://doi.org/10.18653/v1/2025.acl-long.1176 "To the current version of the paper by DOI")Bibkey:bang-etal-2025-hallulensCite (ACL):Yejin Bang, Ziwei Ji, Alan Schelten, A
... (truncated, 4 KB total)Resource ID:
351b4e7354e2dc5b | Stable ID: ZTZhN2E1Yj