FACTS Grounding: A Benchmark for Evaluating Factuality and Grounding in Large Language Models
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant to AI safety researchers concerned with LLM reliability and hallucination; provides a standardized benchmark for measuring factual grounding, a key property for trustworthy AI systems in retrieval-augmented and document-based settings.
Paper Details
Metadata
Abstract
We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.
Summary
Google DeepMind introduces FACTS Grounding, an online leaderboard and benchmark evaluating LLMs' ability to generate responses fully grounded in provided context documents up to 32k tokens. The two-phase automated evaluation first checks whether responses fulfill user requests, then assesses factual grounding, using aggregated judge models to reduce evaluation bias. The benchmark includes both public and private leaderboard splits to enable external participation while preserving integrity.
Key Points
- •Benchmarks LLMs on grounded factuality: whether responses are fully supported by a provided context document, distinct from world-knowledge factuality.
- •Two-phase evaluation: (1) disqualify responses failing to fulfill the user request, (2) judge factual grounding against the source document.
- •Aggregates multiple LLM judge models to mitigate individual model bias in automated evaluation.
- •Supports long-form responses with context documents up to 32k tokens, reflecting realistic RAG and summarization scenarios.
- •Actively maintained online leaderboard hosted on Kaggle, with public and private splits to guard against overfitting and gaming.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
Cached Content Preview
[2501.03200] The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input
\correspondingauthor
facts-leaderboard@google.com
The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input
Alon Jacovi
Equal Contribution
Andrew Wang
Equal Contribution
Chris Alberti
Equal Contribution
Connie Tao
Equal Contribution
Jon Lipovetz
Equal Contribution
Kate Olszewska
Equal Contribution
Lukas Haas
Equal Contribution
Michelle Liu
Equal Contribution
Nate Keating
Equal Contribution
Adam Bloniarz
Carl Saroufim
Corey Fry
Dror Marcus
Doron Kukliansky
Gaurav Singh Tomar
James Swirhun
Jinwei Xing
Lily Wang
Madhu Gurumurthy
Michael Aaron
Moran Ambar
Rachana Fellinger
Rui Wang
Zizhao Zhang
Sasha Goldshtein
Dipanjan Das
Equal Contribution
Abstract
We introduce FACTS Grounding , an online leaderboard and associated benchmark that evaluates language models’ ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard .
1 Introduction
Factuality is one of the most challenging aspects of Large Language Models (LLMs), referring to a model’s ability to generate factually accurate responses in information-seeking scenarios. Commonly, this area of research can be divided into two distinct scenarios: (1) factuality with respect to given context , such as a user request and grounding documents, such that the model response is fully grounded in the input (by this, we imply that a model response has the highest degree of faithfulness to given context as defined by Rashkin et al., 2023 ), and (2) factuality with respect to external sources and general world knowledge ( Tang et
... (truncated, 53 KB total)1f7d271db2f8a756 | Stable ID: MmZkOTMxMG