Trends in Machine Learning Hardware
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Epoch AI
A foundational empirical reference from Epoch AI quantifying hardware scaling trends relevant to understanding compute trajectories, training costs, and the feasibility of future large-scale AI systems.
Metadata
Summary
Epoch AI analyzes performance trends across 47 ML accelerators (GPUs and AI chips) from 2010-2023, finding that computational performance doubles every 2.3 years, price-performance every 2.1 years, and energy efficiency every 3 years, while memory capacity lags behind (doubling every 4 years). The study also highlights how lower-precision formats (FP16, INT8) and tensor cores provide order-of-magnitude speedups over traditional FP32, and examines memory bandwidth and interconnect constraints.
Key Points
- •Computational performance (FP32) doubles every 2.3 years for both ML and general GPUs, with price-performance doubling every 2.1 years.
- •Lower-precision formats like tensor-FP16 and INT8 provide roughly 10x speedup over FP32, enabled by specialized tensor core hardware.
- •Memory capacity and bandwidth lag significantly behind compute (doubling every ~4 years vs. 2.3 years for compute) — a persistent 'memory wall'.
- •Proprietary interconnects like NVLink offer 7x the bandwidth of PCIe 5.0, critical for scaling large multi-chip training clusters.
- •Energy efficiency doubles every 3 years for ML GPUs, a key factor for the sustainability and economics of frontier AI training.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Capability-Alignment Race Model | Analysis | 62.0 |
| AI-Driven Concentration of Power | Risk | 65.0 |
Cached Content Preview
## Executive summary
Peak computational performance of ML hardware for different precisions
OP/s
FP64
FP32
FP16
tensor-FP32/TF32
tensor-FP16
tensor-INT8
tensor-INT4
NVIDIA A100
Google TPU v4
NVIDIA H100 SXM
10111012101310141015FP32FP16FP32 and FP64 supportedEmergence of new number formatsYear20102012201420162018202020222024
\|CC-BYepoch.aiDownload graph
**Figure 1:** Peak computational performance of common ML accelerators at a given precision. New number formats have emerged since 2016. Trendlines are shown for number formats with eight or more accelerators: FP32, FP16 (FP = floating-point, tensor-\* = processed by a tensor core, TF = Nvidia tensor floating-point, INT = integer)
We study the performance of GPUs for computational performance across different number representations, memory capacities and bandwidth, and interconnect bandwidth using a dataset of 47 ML accelerators (GPUs and other AI chips) commonly used in ML experiments from 2010-2023, plus 1,948 additional GPUs from 2006-2021. Our main findings are:
1. Lower-precision number formats like 16-bit floating point (FP16) and 8-bit integers (INT8), combined with specialized tensor core units, can provide order-of-magnitude performance improvements for machine learning workloads compared to traditionally used 32-bit floating point (FP32). For example, we estimate, though using limited amounts of data, that using tensor-FP16 can provide roughly 10x speedup compared to FP32.
2. Given that the overall performance of large hardware clusters for state-of-the-art ML model training and inference depends on factors beyond just computational performance, we investigate memory capacity, memory bandwidth and interconnects, and find that:
1. Memory capacity is doubling every ~4 years and memory bandwidth every ~4.1 years. They have increased at a slower rate than computational performance which doubles every ~2.3 years. This is a common finding and often described as the _memory wall_.
2. The latest ML hardware often comes with proprietary chip-to-chip interconnect protocols (Nvidia’s NVLink or Google’s TPU’s ICI) that offer higher communication bandwidth between chips compared to the PCI Express (PCIe). For example, NVLink in H100 supports 7x the bandwidth of PCIe 5.0.
3. Key hardware performance metrics and their improvement rates found in the analysis include: computational performance \[FLOP/s\] doubling every 2.3 years for both ML and general GPUs; computational price-performance \[FLOP per $\] doubling every 2.1 years for ML GPUs and 2.5 years for general GPUs; and energy efficiency \[FLOP/s per Watt\] doubling every 3.0 years for ML GPUs and 2.7 years for general GPUs.
| | Specification and unit | Growth rate<br>Doubling time10x timeOOMs per year | Datapoint of highest performance<br>Metric prefixScientific notation | N |
| --- | --- | --- | --- | --- |
| Computational Performance | FLOP/s (FP32) | 2x eve
... (truncated, 60 KB total)2efa03ce0d906d78 | Stable ID: M2E3Y2IyOD