Will we run out of data? Limits of LLM scaling based on human-generated data
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Investigates data scarcity constraints on LLM scaling, forecasting that public human-generated text data may be exhausted by 2028-2032, which has critical implications for long-term AI development sustainability and safety considerations around model training limitations.
Paper Details
Metadata
Abstract
We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.
Summary
This paper investigates whether the availability of public human-generated text data will constrain the scaling of large language models (LLMs). The authors forecast training data demand based on current scaling trends and estimate the total stock of publicly available human text, finding that if current development trajectories continue, models will exhaust the available stock of public human text data between 2026 and 2032 (median estimate: 2028). The paper explores potential solutions to overcome this data constraint, including synthetic data generation, transfer learning from data-rich domains, and improved data efficiency.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Timelines | Concept | 95.0 |
Cached Content Preview
Will we run out of data? Limits of LLM scaling based on human-generated data Will we run out of data? Limits of LLM scaling based on human-generated data Pablo Villalobos Anson Ho Jaime Sevilla Tamay Besiroglu Lennart Heim Marius Hobbhahn Abstract We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress. Machine Learning, ICML Figure 1 : Projections of the effective stock of human-generated public text and dataset sizes used to train notable LLMs. The intersection of the stock and dataset size projection lines indicates the median year (2028) in which the stock is expected to be fully utilized if current LLM development trends continue. At this point, models will be trained on dataset sizes approaching the total effective stock of text in the indexed web: around 4e14 tokens, corresponding to training compute of ∼ \sim 5e28 FLOP for non-overtrained models. Individual dots represent dataset sizes of specific notable models. The model is explained in Section 2 1 Introduction Recent progress in language modeling has relied heavily on unsupervised training on vast amounts of human-generated text, primarily sourced from the web or curated corpora (Zhao et al., 2023 ) . The largest datasets of human-generated public text data , such as RefinedWeb, C4, and RedPajama, contain tens of trillions of words collected from billions of web pages (Penedo et al., 2023 ; Together.ai, 2023 ) . The demand for public human text data is likely to continue growing. In order to scale the size of models and training runs efficiently, large language models (LLMs) are typically trained according to neural scaling laws (Kaplan et al., 2020 ; Hoffmann et al., 2022 ) . These relationships imply that increasing the size of training datasets is crucial for efficiently improving the performance of LLMs. In this paper, we argue that human-generated public text data cannot sustain scaling beyond this decade. To support this conclusion, we develop a model of the growing demand for training data and the production of public human text data. We use this model to predict when the trajectory of LLM development will fully exhaust the available stock of public human text data. We then explore a range of potential strategies to circumvent this constraint, such as synthetic data generation, trans
... (truncated, 268 KB total)5dee98c481614176 | Stable ID: ODA3YmJmN2