Trends in Machine Learning Hardware

web

Epoch AI·epochai.org/blog/trends-in-machine-learning-hardware

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Epoch AI

A foundational empirical reference from Epoch AI quantifying hardware scaling trends relevant to understanding compute trajectories, training costs, and the feasibility of future large-scale AI systems.

Metadata

Importance: 72/100blog postanalysis

Summary

Epoch AI analyzes performance trends across 47 ML accelerators (GPUs and AI chips) from 2010-2023, finding that computational performance doubles every 2.3 years, price-performance every 2.1 years, and energy efficiency every 3 years, while memory capacity lags behind (doubling every 4 years). The study also highlights how lower-precision formats (FP16, INT8) and tensor cores provide order-of-magnitude speedups over traditional FP32, and examines memory bandwidth and interconnect constraints.

Key Points

•Computational performance (FP32) doubles every 2.3 years for both ML and general GPUs, with price-performance doubling every 2.1 years.
•Lower-precision formats like tensor-FP16 and INT8 provide roughly 10x speedup over FP32, enabled by specialized tensor core hardware.
•Memory capacity and bandwidth lag significantly behind compute (doubling every ~4 years vs. 2.3 years for compute) — a persistent 'memory wall'.
•Proprietary interconnects like NVLink offer 7x the bandwidth of PCIe 5.0, critical for scaling large multi-chip training clusters.
•Energy efficiency doubles every 3 years for ML GPUs, a key factor for the sustainability and economics of frontier AI training.

Cited by 2 pages

Page	Type	Quality
Capability-Alignment Race Model	Analysis	62.0
AI-Driven Concentration of Power	Risk	65.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202647 KB

Trends in machine learning hardware | Epoch AI 

 
 
 
 

 

 

 
 
 Executive summary

 Enable JavaScript to see an interactive visualization.

 Figure 1: Peak computational performance of common ML accelerators at a given precision. New number formats have emerged since 2016. Trendlines are shown for number formats with eight or more accelerators: FP32, FP16 (FP = floating-point, tensor-* = processed by a tensor core, TF = Nvidia tensor floating-point, INT = integer)

 We study the performance of GPUs for computational performance across different number representations, memory capacities and bandwidth, and interconnect bandwidth using a dataset of 47 ML accelerators (GPUs and other AI chips) commonly used in ML experiments from 2010-2023, plus 1,948 additional GPUs from 2006-2021. Our main findings are:

 
 Lower-precision number formats like 16-bit floating point (FP16) and 8-bit integers (INT8), combined with specialized tensor core units, can provide order-of-magnitude performance improvements for machine learning workloads compared to traditionally used 32-bit floating point (FP32). For example, we estimate, though using limited amounts of data, that using tensor-FP16 can provide roughly 10x speedup compared to FP32.

 Given that the overall performance of large hardware clusters for state-of-the-art ML model training and inference depends on factors beyond just computational performance, we investigate memory capacity, memory bandwidth and interconnects, and find that:
 
 Memory capacity is doubling every ~4 years and memory bandwidth every ~4.1 years. They have increased at a slower rate than computational performance which doubles every ~2.3 years. This is a common finding and often described as the memory wall .

 The latest ML hardware often comes with proprietary chip-to-chip interconnect protocols (Nvidia’s NVLink or Google’s TPU’s ICI) that offer higher communication bandwidth between chips compared to the PCI Express (PCIe). For example, NVLink in H100 supports 7x the bandwidth of PCIe 5.0.

 

 Key hardware performance metrics and their improvement rates found in the analysis include: computational performance [FLOP/s] doubling every 2.3 years for both ML and general GPUs; computational price-performance [FLOP per $] doubling every 2.1 years for ML GPUs and 2.5 years for general GPUs; and energy efficiency [FLOP/s per Watt] doubling every 3.0 years for ML GPUs and 2.7 years for general GPUs.

 Specification and unit Growth rate

 Doubling time 10x time OOMs per year Datapoint of highest performance

 Metric prefix Scientific notation N Computational Performance FLOP/s (FP32) 2x every 2.3 [2.1; 2.6] years 10x every 7.7 [6.9; 8.6] years 0.13 [0.15; 0.12] OOMs per year ~90 TFLOP/s 
 ~9e13 FLOP/s (NVIDIA L40)

 45 FLOP/s (tensor-FP32) NA 1 

 ~495 TFLOP/s 
 ~4.95e14 FLOP/s (NVIDIA H100 SXM)

 7 FLOP/s (tensor-FP16) NA ~990 TFLOP/s 
 ~9.9e14 FLOP/s (NVIDIA H100 SXM)

 8 OP/s (INT8) NA ~1980 TOP/s ~1.98e15 OP/s (NVIDIA H100 SXM) 10 Com

... (truncated, 47 KB total)

Resource ID: 2efa03ce0d906d78 | Stable ID: sid_z8vF0WmCSR