"Will We Run Out of Data?"
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Epoch AI
This Epoch AI analysis is a key empirical reference for understanding whether data scarcity—rather than compute—could become the binding constraint on LLM scaling in the near term, directly relevant to forecasting AI progress trajectories.
Metadata
Summary
Epoch AI estimates the total effective stock of high-quality human-generated public text at approximately 300 trillion tokens (90% CI: 100T–1000T) and projects this data will be fully utilized between 2026 and 2032. The timeline compresses significantly based on overtraining strategies: 100x overtraining could exhaust available data as early as 2025. The analysis highlights data availability as a potential near-term bottleneck to AI scaling alongside compute.
Key Points
- •Total effective stock of high-quality public text data estimated at ~300T tokens (90% CI: 100T–1000T), accounting for quality filtering and multi-epoch training.
- •Data stock projected to be fully utilized between 2026–2032 (80% CI), depending on how models are scaled and overtrained.
- •Overtraining by 100x (fewer parameters, more data) could exhaust available data by 2025; modest 5x overtraining pushes deadline to 2027.
- •Llama 3-70B was overtrained by ~10x, illustrating how industry practice already accelerates data consumption.
- •A profit-maximizing model of AI developer incentives suggests overtraining factors of up to 100x may be economically rational, accelerating the data bottleneck.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Epoch AI | Organization | 51.0 |
Cached Content Preview
Will we run out of data to train large language models? | Epoch AI
Introduction
Scaling has been a key factor driving progress in AI. Models are growing in parameters and being trained on increasingly enormous datasets, leading to exponential growth in training compute, and dramatic increases in performance. For example, five years and four orders of magnitude of compute separate the barely coherent GPT-2 with the powerful GPT-4.
So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing in supply . If chips are the only bottleneck, then AI systems are likely to continue growing exponentially in compute and expanding the frontier of capabilities. As such, a key question in forecasting AI progress is whether inputs other than raw compute could become binding constraints.
In particular, scaling requires growing training datasets. The most powerful AI systems to date are language models that are primarily trained on trillions of words of human-generated text from the internet. However, there is only a finite amount of human-generated data out there, which raises the question of whether training data could become the main bottleneck to scaling.
In our new paper, we attempt to shed light on this question by estimating the stock of human-generated public text data, updating our 2022 analysis of this topic .
Results
We find that the total effective stock of human-generated public text data is on the order of 300 trillion tokens, with a 90% confidence interval of 100T to 1000T. This estimate includes only data that is sufficiently high-quality to be used for training, and accounts for the possibility of training models for multiple epochs. 1
We report estimates of the stocks of several types of data in Figure 1.
Enable JavaScript to see an interactive visualization.
Figure 1: Estimates of different stocks of data, in tokens. 2 These estimates do not take data quality or multi-epoch training into account.
Given our estimate of the data stock, we then forecast when this data would be fully utilized. We develop two models of dataset growth. One simply extrapolates the historical growth rate in dataset sizes, and the other accounts for our projection of training compute growth and derives the corresponding dataset size (more below). Our overall projection, shown in Figure 2, comes from combining these two models. Our 80% confidence interval is that the data stock will be fully utilized at some point between 2026 and 2032.
However, the exact point in time at which this data would be fully utilized depends on how models are scaled. If models are trained compute-optimally, 3 there is enough data to train a model with 5e28 floating-point operations (FLOP), a level we expect to be reached in 2028 (see Figure 2). But recent models, like Llama 3, are often “overtrained” with fewer parameters and more data so that they are more compute-efficient during
... (truncated, 12 KB total)22818e0a00496c03 | Stable ID: MzRkMjczNz