Underspecification in Machine Learning
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A highly influential paper from Google Brain demonstrating that ML reliability problems often stem from underspecification rather than overfitting, directly relevant to AI safety concerns about model predictability and deployment robustness.
Paper Details
Metadata
Abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
Summary
This paper introduces 'underspecification' as a fundamental problem in ML pipelines: many different models with equivalent training performance can behave very differently in deployment. The authors demonstrate that standard ML training procedures return a set of equally good models, with no mechanism to prefer more robust or reliable ones, leading to unpredictable real-world failures.
Key Points
- •ML pipelines are underspecified when training performance cannot distinguish between models that behave differently under deployment conditions.
- •Underspecification is widespread across domains including medical imaging, NLP, and genomics, causing silent failures in real-world settings.
- •Models that perform equivalently on held-out test sets can differ dramatically on stress tests, distribution shifts, and fairness metrics.
- •Standard practices like cross-validation and hyperparameter tuning do not resolve underspecification and may obscure it.
- •Addressing underspecification requires domain-specific inductive biases, stress testing, and explicit robustness criteria during training.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Long-Timelines Technical Worldview | Concept | 91.0 |
Cached Content Preview
# Underspecification Presents Challenges for Credibility in Modern Machine Learning
\\nameAlexander D’Amour \\emailalexdamour@google.com
\\nameKatherine Heller††footnotemark: \\emailkheller@google.com
\\nameDan Moldovan††footnotemark: \\emailmdan@google.com
\\nameBen Adlam \\emailadlam@google.com
\\nameBabak Alipanahi \\emailbabaka@google.com
\\nameAlex Beutel \\emailalexbeutel@google.com
\\nameChristina Chen \\emailchristinium@google.com
\\nameJonathan Deaton \\emailjdeaton@google.com
\\nameJacob Eisenstein \\emailjeisenstein@google.com
\\nameMatthew D. Hoffman \\emailmhoffman@google.com
\\nameFarhad Hormozdiari \\emailfhormoz@google.com
\\nameNeil Houlsby \\emailneilhoulsby@google.com
\\nameShaobo Hou \\emailshaobohou@google.com
\\nameGhassen Jerfel \\emailghassen@google.com
\\nameAlan Karthikesalingam \\emailalankarthi@google.com
\\nameMario Lucic \\emaillucic@google.com
\\nameYian Ma \\emailyianma@ucsd.edu
\\nameCory McLean \\emailcym@google.com
\\nameDiana Mincu \\emaildmincu@google.com
\\nameAkinori Mitani \\emailamitani@google.com
\\nameAndrea Montanari \\emailmontanari@stanford.edu
\\nameZachary Nado \\emailznado@google.com
\\nameVivek Natarajan \\emailnatviv@google.com
\\nameChristopher Nielson \\emailchristopher.nielson@va.gov
\\nameThomas F. Osborne††footnotemark: \\emailthomas.osborne@va.gov
\\nameRajiv Raman \\emaildrrrn@snmail.org
\\nameKim Ramasamy \\emailkim@aravind.org
\\nameRory Sayres \\emailsayres@google.com
\\nameJessica Schrouff \\emailschrouff@google.com
\\nameMartin Seneviratne \\emailmartsen@google.com
\\nameShannon Sequeira \\emailshnnn@google.com
\\nameHarini Suresh \\emailhsuresh@mit.edu
\\nameVictor Veitch \\emailvictorveitch@google.com
\\nameMax Vladymyrov \\emailmxv@google.com
\\nameXuezhi Wang \\emailxuezhiw@google.com
\\nameKellie Webster \\emailwebsterk@google.com
\\nameSteve Yadlowsky \\emailyadlowsky@google.com
\\nameTaedong Yun \\emailtedyun@google.com
\\nameXiaohua Zhai \\emailxzhai@google.com
\\nameD. Sculley \\emaildsculley@google.com
These authors contributed equally to this work.This paper represents the views of the authors, and not of the VA.
###### Abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains
... (truncated, 98 KB total)099b1261e607bc66 | Stable ID: M2JlNTYxN2