Skip to content
Longterm Wiki
Back

"Are Emergent Abilities a Mirage?"

paper

Authors

Rylan Schaeffer·Brando Miranda·Sanmi Koyejo

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Highly influential NeurIPS 2023 paper that directly challenges the 'emergent abilities' narrative central to many AI risk and forecasting arguments, suggesting unpredictable capability jumps may be a measurement artifact rather than a real scaling phenomenon.

Paper Details

Citations
2
22 influential
Year
2023
Methodology
peer-reviewed
Categories
Advances in Neural Information Processing Systems

Metadata

Importance: 82/100arxiv preprintprimary source

Abstract

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

Summary

This paper argues that apparent emergent abilities in large language models are artifacts of metric choice rather than genuine phase transitions in model behavior. Using mathematical modeling and empirical analysis across GPT-3, BIG-Bench, and vision models, the authors show that nonlinear metrics create illusory sharp transitions while linear metrics reveal smooth, predictable scaling. The findings suggest emergent abilities may not be a fundamental property of AI scaling.

Key Points

  • Apparent emergent abilities arise from nonlinear or discontinuous evaluation metrics, not from fundamental changes in model behavior at scale.
  • Switching to linear or continuous metrics reveals smooth, predictable performance improvements across model sizes, eliminating apparent phase transitions.
  • Authors demonstrate the effect empirically on InstructGPT/GPT-3 family, BIG-Bench tasks, and vision models across diverse architectures.
  • The paper shows researchers can artificially induce 'never-before-seen' emergent abilities in vision tasks simply by choosing appropriate metrics.
  • Challenges a widely cited property of LLM scaling with significant implications for AI forecasting, risk modeling, and capability evaluation.

Cited by 4 pages

PageTypeQuality
AI Accident Risk CruxesCrux67.0
AI Scaling LawsConcept92.0
Emergent CapabilitiesRisk61.0
Sharp Left TurnRisk69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202662 KB
# Are Emergent Abilities of Large Language Models a Mirage?

Rylan Schaeffer
Brando Miranda
Sanmi Koyejo

###### Abstract

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models.
What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales.
Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous, predictable changes in model performance.
We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities, (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks.
Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

## 1 Introduction

Emergent properties of complex systems have long been studied across disciplines, from physics to biology to mathematics.
The idea of emergence was popularized by Nobel Prize-winning physicist P.W. Anderson’s “More Is Different” ( [anderson1972more,](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib1 "")), which argues that as the complexity of a system increases, new properties may materialize that cannot be predicted even from a precise quantitative understanding of the system’s microscopic details. Recently, the idea of emergence gained significant attention in machine learning due to observations that large language models (LLMs) such as GPT ( [brown2020language,](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib3 "")), PaLM ( [chowdhery2022palm,](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib6 "")) and LaMDA [thoppilan2022lamda](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib30 "") exhibit so-called “emergent abilities” ( [wei2022emergent,](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib33 ""); [ganguli2022predictability,](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib8 ""); [srivastava2022beyond,](https://ar5iv.labs.arxiv.org/html/2304.15004#bib.bib28 ""); [brown2020language,](https://ar5iv.labs.arxiv.org/html/2304.1

... (truncated, 62 KB total)
Resource ID: 22db72cf2a806d3b | Stable ID: NDkxNjIwY2