Epoch AI, "Frontier LLM training runs can't get much longer" (https://epoch.ai/data-insights/longest-training-run)

web

Epoch AI·epoch.ai/data-insights/longest-training-run

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Epoch AI

Relevant to forecasting AI progress and understanding scaling limits; informs debates about whether frontier capabilities can continue growing via training compute scaling alone.

Metadata

Importance: 55/100organizational reportanalysis

Summary

Epoch AI analyzes the physical and practical limits on how long frontier AI training runs can be extended, finding that training duration is approaching natural ceilings due to hardware reliability, data constraints, and optimization dynamics. The analysis suggests that simply scaling training time is not a viable path for continued capability gains at the frontier.

Key Points

•Frontier LLM training runs are approaching practical duration limits due to hardware failure rates and cluster reliability at scale.
•Data availability constrains training length, as high-quality internet text is increasingly exhausted for pretraining.
•Optimization dynamics mean returns from longer training diminish, limiting the gains from simply extending run duration.
•This finding has implications for AI scaling trajectories and suggests compute scaling must occur via other dimensions (model size, data quality, new architectures).
•The analysis uses empirical data on training run lengths and failure rates to bound realistic maximum training durations.

Cited by 1 page

Page	Type	Quality
Capability-Alignment Race Model	Analysis	62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20266 KB

Frontier LLM training runs can’t get much longer | Epoch AI 

 
 
 

 

 
 

 

 

 
 In “ The Longest Training Run ”, we argue that training runs that last too long are outclassed by training runs that start later and benefit from additional hardware and algorithmic improvements . Based on our latest numbers, this suggests that training runs lasting more than 9 months may be inefficient. At the current pace, training runs will reach this size around 2027 (90% CI: Aug 2025 to Sept 2029).

 Enable JavaScript to see an interactive visualization.

 Longer training runs are a significant driver of the rapid growth seen in training compute . If training time stops increasing, training compute growth will slow – unless developers ramp up hardware scaling even faster. This could be achieved by speeding up the build-out of larger clusters, or by spreading training across multiple clusters.

 
Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY
 license.

 Learn more about this graph

 We show that, since 2020, the time that frontier LLM systems spend training has grown by about 1.4x per year (90% CI: 1.3x to 1.5x). Separately, there are economic reasons to expect that training runs longer than about 9 months are sub-optimal. On current trends, frontier AI systems will hit this 9 month limit by around 2027 (90% CI: 2025 to 2029).

 Since training time has contributed about 1/3rd of total scaling progress since 2018 , an end to this trend could mean slower overall compute growth after 2027. Conversely, model developers could respond by increasing the number of chips they train on, either by speeding up their training cluster build-outs, or by distributing training across more clusters.

 Data 
 
 
 

 Our data come from Epoch AI’s Notable AI Models dataset . We focus our analysis on frontier LLMs, defined as those that were among the top-5 largest by compute at the time they were published. We then filter to models released after 2020, and those which have estimates of training duration, leaving a dataset of 38 models

 Separately, we use previous estimates for trends in algorithmic progress , hardware price-performance , and training budgets to estimate the longest training time under a simple model of opportunity costs. The hardware progress estimate is based on data from Epoch AI’s ML Hardware dataset , while algorithmic progress and training budgets each use Epoch AI’s Notable AI models dataset .

 Analysis 
 
 
 

 To obtain an estimate for the trend in frontier LLM training time, we fit a simple model regressing the logarithm of training time on publication date. We bootstrap this regression to obtain a 90% confidence interval.

 Our main analysis of the longest training runs focuses on the “hardware and software progress” scenario of our earlier report, “ The Longest Training Run ”. We update the numbers from that analysis using our latest estimates for each of the underly

... (truncated, 6 KB total)

Resource ID: 9d535d8e91127085 | Stable ID: sid_thPBsinnkN