Villalobos et al.

paper

2022·arXiv·arxiv.org/abs/2211.04325

Authors

Pablo Villalobos·Anson Ho·Jaime Sevilla·Tamay Besiroglu·Lennart Heim·Marius Hobbhahn

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical research forecasting data scarcity constraints on LLM scaling, predicting saturation of public text training data by 2026-2032—critical for understanding fundamental limitations on continued model scaling and implications for future AI development.

Paper Details

Citations

7 influential

Year

2018

Metadata

arxiv preprintanalysis

Abstract

We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

Summary

This paper investigates the constraints on large language model (LLM) scaling imposed by the finite availability of public human-generated text data. The authors forecast training data demand based on current trends and estimate the total stock of publicly available human text, finding that models will exhaust this supply between 2026 and 2032 under current development trajectories. The paper examines potential pathways for continued progress beyond this data bottleneck, including synthetic data generation, transfer learning from data-rich domains, and improvements in data efficiency.

Cited by 1 page

Page	Type	Quality
Epoch AI	Organization	51.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202639 KB

[2211.04325] Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

 
 
 Pablo Villalobos1, Jaime Sevilla12, Lennart Heim14, Tamay Besiroglu13, Marius Hobbhahn 15, Anson Ho1
 
 

 
 Abstract

 We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.
 † † 1Epoch, 2University of Aberdeen, 3MIT Computer Science & Artificial Intelligence Laboratory, 4Centre for the Governance of AI, 5University of Tübingen 

 
 
 Key Takeaways

 
 
 
 • 
 
 We project the growth of training datasets for vision and language models using both the historical growth rate and the compute-optimal dataset size given current scaling laws and existing compute availability estimates (Section III-A ).

 

 
 • 
 
 We also project the growth in the total stock of unlabeled data, including high-quality language data (Section III-B ).

 

 
 • 
 
 Language datasets have grown exponentially by more than 50% per year, and contain up to 2e12 words as of October 2022. (section IV-A )

 

 
 • 
 
 The stock of language data currently grows by  7% yearly, but our model predicts a slowdown to  1% by 2100. This stock is currently between 7e13 and 7e16 words, which is 1.5 to 4.5 orders of magnitude larger than the largest datasets used today (Section IV-B 1 ).

 

 
 • 
 
 Based on these trends, we will likely run out of language data between 2030 and 2050 (Section IV-D ).

 

 
 • 
 
 However, language models are usually trained on high-quality data. The stock of high-quality language data is between 4.6e12 and 1.7e13 words, which is less than one order of magnitude larger than the largest datasets (Section IV-B 2 ).

 

 
 • 
 
 We are within one order of magnitude of exhausting high-quality data, and this will likely happen between 2023 and 2027 (Section IV-D ).

 

 
 • 
 
 Projecting the future growth of image datasets is less obvious than for language, because the historical trend stopped in the past four years 1 1 1 New models appeared which use much more data than wha

... (truncated, 39 KB total)

Resource ID: 5c0de3116cb53b56 | Stable ID: sid_vqRAHPu69w