[2009.03300] Measuring Massive Multitask Language Understanding

paper

2020·arXiv·arxiv.org/abs/2009.03300

Authors

Dan Hendrycks·Collin Burns·Steven Basart·Andy Zou·Mantas Mazeika·Dawn Song·Jacob Steinhardt

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

MMLU is a foundational capability benchmark widely used to track LLM progress; relevant to AI safety for understanding capability levels and evaluating whether models meet thresholds for safe deployment.

Paper Details

Citations

7,359

1436 influential

Year

2020

arXiv:2009.03300 Semantic Scholar

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Summary

Introduces the MMLU benchmark, a comprehensive evaluation suite covering 57 subjects across STEM, humanities, social sciences, and more, designed to measure breadth and depth of language model knowledge. The benchmark tests models from elementary to professional level and reveals significant gaps between human expert performance and state-of-the-art models at the time of publication. It became a standard benchmark for tracking LLM capability progress.

Key Points

•Introduces 57-subject multiple-choice benchmark spanning STEM, humanities, law, medicine, and more to assess multitask language understanding
•Tests knowledge at difficulty levels from elementary school through professional/expert level (e.g., bar exam, medical licensing questions)
•At publication, best models achieved ~40-50% accuracy versus human expert ~90%, highlighting large capability gaps
•Benchmark design emphasizes world knowledge and problem-solving, not just linguistic pattern matching
•Has since become one of the most widely cited benchmarks for tracking frontier LLM capability improvements

Cited by 3 pages

Page	Type	Quality
AI Capability Threshold Model	Analysis	72.0
FAR AI	Organization	76.0
Dan Hendrycks	Person	19.0

1 FactBase fact citing this source

Entity	Property	Value	As Of
Center for AI Safety (CAIS)	publication	Measuring Massive Multitask Language Understanding (MMLU) — widely-used benchmark for evaluating LLM capabilities across 57 academic subjects	Jan 2021

Cached Content Preview

HTTP 200Fetched Apr 7, 202679 KB

[2009.03300] Measuring Massive Multitask Language Understanding 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Measuring Massive Multitask
 Language Understanding

 
 
 
Dan Hendrycks
 UC Berkeley &Collin Burns
 Columbia University &Steven Basart
 UChicago &Andy Zou
 UC Berkeley \AND Mantas Mazeika
 UIUC &Dawn Song
 UC Berkeley &Jacob Steinhardt
 UC Berkeley \AND 
 
 

 
 Abstract

 We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

 
 
 
 1 Introduction

 
 Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks.
However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models.
The General Language Understanding Evaluation benchmark (GLUE) (Wang et al., 2018 ) was introduced in 2018 to evaluate performance on a wide range of NLP tasks, and top models achieved superhuman performance within a year. To address the shortcomings of GLUE, researchers designed the SuperGLUE benchmark with more difficult tasks (Wang et al., 2019 ) . About a year since the release of SuperGLUE, performance is again essentially human-level (Raffel et al., 2019 ) . While these benchmarks evaluate linguistic skills more than overall language understanding, an array of commonsense benchmarks have been proposed to measure basic reasoning and everyday knowledge (Zellers et al., 2019 ; Huang et al., 2019 ; Bisk et al., 2019 ) .
However, these recent benchmarks have similarly seen rapid progress (Khashabi et al., 2020 ) . Overall, the near human-level performance on these benchmarks suggests that they are not capturing important facets of language understanding.

 
 
 Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks.
It conse

... (truncated, 79 KB total)

Resource ID: 0635974beafcf9c5 | Stable ID: sid_OIbChg2ag9