Visibility into AI Chips

paper

2023·arXiv·arxiv.org/abs/2303.11341

Author

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A foundational technical policy paper exploring how compute governance could be operationalized; highly relevant to ongoing debates about AI chip export controls and international AI oversight regimes.

Paper Details

Citations

2 influential

Year

2023

arXiv:2303.11341 DOI:10.48550/arXiv.2303.11341 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

As advanced machine learning systems' capabilities begin to play a significant role in geopolitics and societal order, it may become imperative that (1) governments be able to enforce rules on the development of advanced ML systems within their borders, and (2) countries be able to verify each other's compliance with potential future international agreements on advanced ML development. This work analyzes one mechanism to achieve this, by monitoring the computing hardware used for large-scale NN training. The framework's primary goal is to provide governments high confidence that no actor uses large quantities of specialized ML chips to execute a training run in violation of agreed rules. At the same time, the system does not curtail the use of consumer computing devices, and maintains the privacy and confidentiality of ML practitioners' models, data, and hyperparameters. The system consists of interventions at three stages: (1) using on-chip firmware to occasionally save snapshots of the the neural network weights stored in device memory, in a form that an inspector could later retrieve; (2) saving sufficient information about each training run to prove to inspectors the details of the training run that had resulted in the snapshotted weights; and (3) monitoring the chip supply chain to ensure that no actor can avoid discovery by amassing a large quantity of un-tracked chips. The proposed design decomposes the ML training rule verification problem into a series of narrow technical challenges, including a new variant of the Proof-of-Learning problem [Jia et al. '21].

Summary

This paper proposes a technical framework for governments to monitor and verify compliance with international agreements on large-scale AI training by tracking specialized ML chips. The system uses on-chip firmware for weight snapshots, training run documentation, and supply chain monitoring to provide high confidence of compliance while preserving model privacy. It decomposes the verification problem into narrow technical challenges including a new variant of Proof-of-Learning.

Key Points

•Proposes three-stage compute monitoring system: on-chip firmware snapshots, training run documentation, and supply chain tracking for ML accelerators.
•Distinguishes consumer GPUs from specialized ML chips (TPUs, A100s) to avoid restricting general computing while enabling targeted oversight.
•Aims to preserve privacy of model weights, training data, and hyperparameters while still enabling inspector verification of compliance.
•Introduces a new variant of the Proof-of-Learning problem as a core technical primitive for verifying training run authenticity.
•Directly addresses governance challenges of international AI agreements by providing a concrete technical verification mechanism.

Cited by 2 pages

Page	Type	Quality
Hardware Mechanisms for International AI Agreements	Analysis	--
Governance-Focused Worldview	Concept	67.0

1 FactBase fact citing this source

Entity	Property	Value	As Of
Yonadav Shavit	Notable For	Authored 'What does it take to catch a Chinchilla?' (arXiv:2303.11341, March 2023) — first major compute monitoring framework proposing on-chip firmware snapshots for international AI governance verification	Mar 2023

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2303.11341] A template for Arxiv Style Citation: Authors. Title. Pages…. DOI:000000/11111. 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 A template for Arxiv Style
 † † thanks: Citation :
 Authors. Title. Pages…. DOI:000000/11111. 

 
 
 
Yonadav Shavit 
 Harvard University 
 yonadav@g.harvard.edu
 
 
 

 What does it take to catch a Chinchilla?
 Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring

 
 
 
Yonadav Shavit 
 Harvard University 
 yonadav@g.harvard.edu
 
 
 

 
 Abstract

 As advanced machine learning systems’ capabilities begin to play a significant role in geopolitics and societal order, it may become imperative that (1) governments be able to enforce rules on the development of advanced ML systems within their borders, and (2) countries be able to verify each other’s compliance with potential future international agreements on advanced ML development.
This work analyzes one mechanism to achieve this, by monitoring the computing hardware used for large-scale NN training.
The framework’s primary goal is to provide governments high confidence that no actor uses large quantities of specialized ML chips to execute a training run in violation of agreed rules.
At the same time, the system does not curtail the use of consumer computing devices, and maintains the privacy and confidentiality of ML practitioners’ models, data, and hyperparameters.
The system consists of interventions at three stages:
(1) using on-chip firmware to occasionally save snapshots of the the neural network weights stored in device memory, in a form that an inspector could later retrieve; (2) saving sufficient information about each training run to prove to inspectors the details of the training run that had resulted in the snapshotted weights; and (3) monitoring the chip supply chain to ensure that no actor can avoid discovery by amassing a large quantity of un-tracked chips.
The proposed design decomposes the ML training rule verification problem into a series of narrow technical challenges, including a new variant of the Proof-of-Learning problem [Jia et al. ’21].

 
 
 
 1 Introduction

 
 Many of the remarkable advances of the past 5 years in deep learning have been driven by a continuous increase in the quantity of training compute used to develop cutting-edge models [ 25 , 21 , 54 ] .
Such large-scale training has been made possible through the concurrent use of hundreds or thousands of specialized accelerators with high inter-chip communication bandwidth (such as Google TPUs, NVIDIA A100 and H100 GPUs, or AMD MI250 GPUs), employed for a span of weeks or months to compute thousands or millions of gradient updates.
We refer to these specialized accelerators as ML chips , which we distinguish from consumer-oriented GPUs with lower interconnect bandwidth (e.g., the NVIDIA RTX 4090, used in gaming computers).

 
 
 This compute scaling trend has yielded models with ever more useful capabilities.
However, these advanced capabilities also bring

... (truncated, 98 KB total)

Resource ID: 2e8fad2698fb965b | Stable ID: sid_JZPahOvukA