Skip to content
Longterm Wiki
Back

Villalobos et al.

paper

Authors

Pablo Villalobos·Anson Ho·Jaime Sevilla·Tamay Besiroglu·Lennart Heim·Marius Hobbhahn

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical research forecasting data scarcity constraints on LLM scaling, predicting saturation of public text training data by 2026-2032—critical for understanding fundamental limitations on continued model scaling and implications for future AI development.

Paper Details

Citations
0
7 influential
Year
2018
Categories
The NamesforLife Abstracts

Metadata

arxiv preprintanalysis

Abstract

We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

Summary

This paper investigates the constraints on large language model (LLM) scaling imposed by the finite availability of public human-generated text data. The authors forecast training data demand based on current trends and estimate the total stock of publicly available human text, finding that models will exhaust this supply between 2026 and 2032 under current development trajectories. The paper examines potential pathways for continued progress beyond this data bottleneck, including synthetic data generation, transfer learning from data-rich domains, and improvements in data efficiency.

Cited by 1 page

PageTypeQuality
Epoch AIOrganization51.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202648 KB
# Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

Pablo Villalobos1, Jaime Sevilla12, Lennart Heim14, Tamay Besiroglu13, Marius Hobbhahn 15, Anson Ho1

###### Abstract

We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.
††1Epoch, 2University of Aberdeen, 3MIT Computer Science & Artificial Intelligence Laboratory, 4Centre for the Governance of AI, 5University of Tübingen

## Key Takeaways

- •


We project the growth of training datasets for vision and language models using both the historical growth rate and the compute-optimal dataset size given current scaling laws and existing compute availability estimates (Section [III-A](https://ar5iv.labs.arxiv.org/html/2211.04325#S3.SS1 "III-A Projecting growth in training dataset sizes ‣ III Methods ‣ Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning")).

- •


We also project the growth in the total stock of unlabeled data, including high-quality language data (Section [III-B](https://ar5iv.labs.arxiv.org/html/2211.04325#S3.SS2 "III-B Estimating data accumulation rates ‣ III Methods ‣ Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning")).

- •


Language datasets have grown exponentially by more than 50% per year, and contain up to 2e12 words as of October 2022. (section [IV-A](https://ar5iv.labs.arxiv.org/html/2211.04325#S4.SS1 "IV-A Trends in dataset size ‣ IV Analysis ‣ Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning"))

- •


The stock of language data currently grows by  7% yearly, but our model predicts a slowdown to  1% by 2100. This stock is currently between 7e13 and 7e16 words, which is 1.5 to 4.5 orders of magnitude larger than the largest datasets used today (Section [IV-B1](https://ar5iv.labs.arxiv.org/html/2211.04325#S4.SS2.SSS1 "IV-B1 Low-quality data ‣ IV-B Language ‣ IV Analysis ‣ Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning")).

- •


Based on these trends, we will likely run out of language data between 2030 and 2050 (Sec

... (truncated, 48 KB total)
Resource ID: 5c0de3116cb53b56 | Stable ID: YzU1MjgzZm