Fact

Center for AI Safety — publication: Measuring Massive Multitask Language Understanding (MMLU) — widely-used benchmark for evaluating LLM capabilities across 57 academic subjects

partial85% confidence

1 evidence check

Last checked: 3/31/2026

The source text (arxiv.org/abs/2009.03300) confirms the core facts: (1) MMLU measures multitask language understanding, (2) it covers 57 tasks/subjects across diverse domains, and (3) it was created by Hendrycks et al. However, the claim attributes this publication to 'Center for AI Safety (CAIS)' but the source lists the authors' affiliations as UC Berkeley, Columbia University, UChicago, and UIUC—not CAIS. The arxiv submission date is 2009.03300 (September 2020), not 2021-01. The claim about it becoming 'one of the most-cited AI benchmarks' is not addressed in the provided source excerpt. The authorship attribution to CAIS appears to be incorrect based on the author affiliations shown.

Evidence — 1 source, 1 check

arxiv.org/abs/2009.03300(1 check)

partial85%primaryHaiku 4.5 · 3/31/2026

Found: The source confirms MMLU covers 57 tasks/subjects and was created by Hendrycks et al. However, the source does not identify CAIS as the publisher/creator, and does not confirm the 2021-01 date or that…

Note: The source text (arxiv.org/abs/2009.03300) confirms the core facts: (1) MMLU measures multitask language understanding, (2) it covers 57 tasks/subjects across diverse domains, and (3) it was created by Hendrycks et al. However, the claim attributes this publication to 'Center for AI Safety (CAIS)' but the source lists the authors' affiliations as UC Berkeley, Columbia University, UChicago, and UIUC—not CAIS. The arxiv submission date is 2009.03300 (September 2020), not 2021-01. The claim about it becoming 'one of the most-cited AI benchmarks' is not addressed in the provided source excerpt. The authorship attribution to CAIS appears to be incorrect based on the author affiliations shown.

Debug info

Record type: fact

Record ID: f_mGXpFffUh7