FACTS Grounding: A Benchmark for Evaluating Factuality and Grounding in Large Language Models

paper

2025·arXiv·arxiv.org/abs/2501.03200

Authors

Alon Jacovi·Andrew Wang·Chris Alberti·Connie Tao·Jon Lipovetz·Kate Olszewska·Lukas Haas·Michelle Liu·Nate Keating·Adam Bloniarz·Carl Saroufim·Corey Fry·Dror Marcus·Doron Kukliansky·Gaurav Singh Tomar·James Swirhun·Jinwei Xing·Lily Wang·Madhu Gurumurthy·Michael Aaron·Moran Ambar·Rachana Fellinger·Rui Wang·Zizhao Zhang·Sasha Goldshtein·Dipanjan Das

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety researchers concerned with LLM reliability and hallucination; provides a standardized benchmark for measuring factual grounding, a key property for trustworthy AI systems in retrieval-augmented and document-based settings.

Paper Details

Citations

3 influential

Year

2025

arXiv:2501.03200 DOI:10.48550/arXiv.2501.03200 Semantic Scholar

Metadata

Importance: 55/100arxiv preprintdataset

Abstract

We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.

Summary

Google DeepMind introduces FACTS Grounding, an online leaderboard and benchmark evaluating LLMs' ability to generate responses fully grounded in provided context documents up to 32k tokens. The two-phase automated evaluation first checks whether responses fulfill user requests, then assesses factual grounding, using aggregated judge models to reduce evaluation bias. The benchmark includes both public and private leaderboard splits to enable external participation while preserving integrity.

Key Points

•Benchmarks LLMs on grounded factuality: whether responses are fully supported by a provided context document, distinct from world-knowledge factuality.
•Two-phase evaluation: (1) disqualify responses failing to fulfill the user request, (2) judge factual grounding against the source document.
•Aggregates multiple LLM judge models to mitigate individual model bias in automated evaluation.
•Supports long-form responses with context documents up to 32k tokens, reflecting realistic RAG and summarization scenarios.
•Actively maintained online leaderboard hosted on Kaggle, with public and private splits to guard against overfitting and gaming.

Cited by 1 page

Page	Type	Quality
Large Language Models	Capability	60.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202653 KB

[2501.03200] The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \correspondingauthor 
 facts-leaderboard@google.com

 
 The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

 
 
 Alon Jacovi
 
 Equal Contribution
 
 
 
 
 Andrew Wang
 
 Equal Contribution
 
 
 
 
 Chris Alberti
 
 Equal Contribution
 
 
 
 
 Connie Tao
 
 Equal Contribution
 
 
 
 
 Jon Lipovetz
 
 Equal Contribution
 
 
 
 
 Kate Olszewska
 
 Equal Contribution
 
 
 
 
 Lukas Haas
 
 Equal Contribution
 
 
 
 
 Michelle Liu
 
 Equal Contribution
 
 
 
 
 Nate Keating
 
 Equal Contribution
 
 
 
 
 Adam Bloniarz
 
 
 
 
 Carl Saroufim
 
 
 
 
 Corey Fry
 
 
 
 
 Dror Marcus
 
 
 
 
 Doron Kukliansky
 
 
 
 
 Gaurav Singh Tomar
 
 
 
 
 James Swirhun
 
 
 
 
 Jinwei Xing
 
 
 
 
 Lily Wang
 
 
 
 
 Madhu Gurumurthy
 
 
 
 
 Michael Aaron
 
 
 
 
 Moran Ambar
 
 
 
 
 Rachana Fellinger
 
 
 
 
 Rui Wang
 
 
 
 
 Zizhao Zhang
 
 
 
 
 Sasha Goldshtein
 
 
 
 
 Dipanjan Das
 
 Equal Contribution
 
 
 
 

 
 Abstract

 We introduce FACTS Grounding , an online leaderboard and associated benchmark that evaluates language models’ ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard .

 
 
 
 1 Introduction

 
 Factuality is one of the most challenging aspects of Large Language Models (LLMs), referring to a model’s ability to generate factually accurate responses in information-seeking scenarios. Commonly, this area of research can be divided into two distinct scenarios: (1) factuality with respect to given context , such as a user request and grounding documents, such that the model response is fully grounded in the input (by this, we imply that a model response has the highest degree of faithfulness to given context as defined by Rashkin et al., 2023 ), and (2) factuality with respect to external sources and general world knowledge  ( Tang et 

... (truncated, 53 KB total)

Resource ID: 1f7d271db2f8a756 | Stable ID: sid_HUilpUujnc