[2406.04127] Are We Done with MMLU?
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Important for AI safety researchers relying on MMLU to assess LLM capabilities; demonstrates that benchmark errors can systematically mislead capability evaluations and model comparisons, which has downstream implications for deployment and alignment decisions.
Paper Details
Metadata
Abstract
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. We estimate that 6.49% of MMLU questions contain errors. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0.
Summary
This paper systematically identifies and categorizes errors in the MMLU benchmark, finding that a substantial fraction of questions (e.g., 57% of Virology questions) contain ground truth errors. The authors introduce MMLU-Redux, a manually re-annotated subset of 3,000 questions, and show that these errors meaningfully distort LLM performance metrics and model rankings.
Key Points
- •MMLU contains widespread ground truth errors, with some subsets (e.g., Virology) having 57% of questions affected, undermining benchmark reliability.
- •A novel error taxonomy is introduced to categorize dataset errors ranging from parsing mistakes to deeper contextual and quality issues.
- •MMLU-Redux provides 3,000 manually re-annotated questions across 30 subjects by 14 human experts as a corrected evaluation subset.
- •Re-evaluation on MMLU-Redux reveals significant discrepancies in reported LLM performance metrics, including changes in model rankings.
- •The work highlights the need for automated error-detection approaches, as reviewing all remaining MMLU questions manually is infeasible at scale.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Dan Hendrycks | Person | 19.0 |
Cached Content Preview
[2406.04127] Are We Done with MMLU?
Are We Done with MMLU?
Aryo Pradipta Gema 1
Joshua Ong Jun Leang 1
Giwon Hong 1
Alessio Devoto 2
Alberto Carlo Maria Mancino 2,3
Rohit Saxena 1
Xuanli He 4
Yu Zhao 1
Xiaotang Du 1
Mohammad Reza Ghasemi Madani 5
Claire Barale 1
Robert McHardy 6
Joshua Harris 7
Jean Kaddour 4
Emile van Krieken 1
Pasquale Minervini 1
1 University of Edinburgh
2 Sapienza University of Rome
3 Polytechnic University of Bari
4 University College London
5 University of Trento
6 AssemblyAI
7 UK Health Security Agency
{first.last, jong2, p.minervini}@ed.ac.uk
alessio.devoto@uniroma1.it alberto.mancino@poliba.it
mr.ghasemimadani@unitn.it joshua.harris@ukhsa.gov.uk
{xuanli.he, jean.kaddour.20, robert.mchardy.20}@ucl.ac.uk
Abstract
Maybe not.
We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark.
Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs.
For example, we find that 57% of the analysed questions in the Virology subset contain errors.
To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy.
Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported.
Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark.
Therefore, we open up MMLU-Redux for additional annotation
https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux .
1 Introduction
The advent of transformer-based Large Language Models (LLMs) [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 ] marked a significant advancement in generative models, enabling interaction with computing devices through natural language.
This advancement rendered many earlier benchmarks and leaderboards obsolete [ 9 , 10 ] , leading to the compilation of more challenging and comprehensive tests.
Among these benchmarks, Massive Multitask Language Understanding (MMLU) [ 11 ] has gained significant popularity: It assesses both the breadth and depth of language understanding capabilities of current LLMs across a diverse range of subjects, including mathematics, history, computer science, logic, law, etc.
However, the reliability of benchmarking results is only as robust as the quality of the dataset used.
We find that, despite its popularity, MMLU suffers from numerous errors that can mislead evaluation and model comparison [ 12 , 13 ] .
These errors, which range from simple parsing and scraping mistakes to more complex issues related to context, interpretation, and dataset quality, compromise the reliability of MMLU as a benchm
... (truncated, 72 KB total)20fd64a19e2b88ce | Stable ID: OTFiMTUyYz