[2009.03300] Measuring Massive Multitask Language Understanding
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
MMLU is a foundational capability benchmark widely used to track LLM progress; relevant to AI safety for understanding capability levels and evaluating whether models meet thresholds for safe deployment.
Paper Details
Metadata
Abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
Summary
Introduces the MMLU benchmark, a comprehensive evaluation suite covering 57 subjects across STEM, humanities, social sciences, and more, designed to measure breadth and depth of language model knowledge. The benchmark tests models from elementary to professional level and reveals significant gaps between human expert performance and state-of-the-art models at the time of publication. It became a standard benchmark for tracking LLM capability progress.
Key Points
- •Introduces 57-subject multiple-choice benchmark spanning STEM, humanities, law, medicine, and more to assess multitask language understanding
- •Tests knowledge at difficulty levels from elementary school through professional/expert level (e.g., bar exam, medical licensing questions)
- •At publication, best models achieved ~40-50% accuracy versus human expert ~90%, highlighting large capability gaps
- •Benchmark design emphasizes world knowledge and problem-solving, not just linguistic pattern matching
- •Has since become one of the most widely cited benchmarks for tracking frontier LLM capability improvements
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
| FAR AI | Organization | 76.0 |
| Dan Hendrycks | Person | 19.0 |
1 FactBase fact citing this source
| Entity | Property | Value | As Of |
|---|---|---|---|
| Center for AI Safety | publication | Measuring Massive Multitask Language Understanding (MMLU) — widely-used benchmark for evaluating LLM capabilities across 57 academic subjects | Jan 2021 |
Cached Content Preview
# Measuring Massive Multitask Language Understanding
Dan Hendrycks
UC Berkeley &Collin Burns
Columbia University &Steven Basart
UChicago &Andy Zou
UC Berkeley \\ANDMantas Mazeika
UIUC &Dawn Song
UC Berkeley &Jacob Steinhardt
UC Berkeley\\AND
###### Abstract
We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
## 1 Introduction
Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks.
However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models.
The General Language Understanding Evaluation benchmark (GLUE) (Wang et al., [2018](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib29 "")) was introduced in 2018 to evaluate performance on a wide range of NLP tasks, and top models achieved superhuman performance within a year. To address the shortcomings of GLUE, researchers designed the SuperGLUE benchmark with more difficult tasks (Wang et al., [2019](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib30 "")). About a year since the release of SuperGLUE, performance is again essentially human-level (Raffel et al., [2019](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib25 "")). While these benchmarks evaluate linguistic skills more than overall language understanding, an array of commonsense benchmarks have been proposed to measure basic reasoning and everyday knowledge (Zellers et al., [2019](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib31 ""); Huang et al., [2019](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib13 ""); Bisk et al., [2019](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib2 "")).
However, these recent benchmarks have similarly seen rapid progress (Khashabi et al., [2020](https://ar5iv.labs.arxiv.org/html/2009.03300#bib.bib15 "")). Overall, the near human-level performance on these benchmarks suggests that they are not capturing important facets of language understanding.
Transformer mod
... (truncated, 84 KB total)0635974beafcf9c5 | Stable ID: YjVjOGRhMT