realistic OOD benchmarks (2024)

paper

2024·arXiv·arxiv.org/abs/2404.10474

Authors

Pietro Recalcati·Fabio Garcea·Luca Piano·Fabrizio Lamberti·Lia Morra

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety for understanding how reliably neural networks can flag unfamiliar inputs; better OOD benchmarks improve confidence in deployed model robustness evaluations.

Paper Details

Citations

0 influential

Year

2023

arXiv:2404.10474 DOI:10.1109/DSAA60987.2023.10302486 Semantic Scholar

Metadata

Importance: 52/100arxiv preprintprimary source

Abstract

Deep neural networks are increasingly used in a wide range of technologies and services, but remain highly susceptible to out-of-distribution (OOD) samples, that is, drawn from a different distribution than the original training set. A common approach to address this issue is to endow deep neural networks with the ability to detect OOD samples. Several benchmarks have been proposed to design and validate OOD detection techniques. However, many of them are based on far-OOD samples drawn from very different distributions, and thus lack the complexity needed to capture the nuances of real-world scenarios. In this work, we introduce a comprehensive benchmark for OOD detection, based on ImageNet and Places365, that assigns individual classes as in-distribution or out-of-distribution depending on the semantic similarity with the training set. Several techniques can be used to determine which classes should be considered in-distribution, yielding benchmarks with varying properties. Experimental results on different OOD detection techniques show how their measured efficacy depends on the selected benchmark and how confidence-based techniques may outperform classifier-based ones on near-OOD samples.

Summary

This paper critiques existing OOD detection benchmarks for relying on far-OOD samples that are too easily distinguishable, and introduces a more realistic benchmark using ImageNet and Places365 where in-distribution vs. OOD classes are assigned based on semantic similarity. Experimental results reveal that measured performance of OOD detection methods varies significantly by benchmark choice, and that confidence-based methods can outperform classifier-based approaches in near-OOD settings.

Key Points

•Existing OOD benchmarks often use far-OOD samples from drastically different distributions, failing to capture real-world complexity and near-OOD nuances.
•The proposed benchmark assigns ImageNet and Places365 classes as in- or out-of-distribution based on semantic similarity to the training set, enabling near-OOD evaluation.
•OOD detection method rankings shift substantially depending on which benchmark is used, highlighting risks of over-relying on a single benchmark for model selection.
•Confidence-based methods can outperform classifier-based approaches on near-OOD samples, contrary to common assumptions in the field.
•More realistic benchmarks are critical for ensuring deployed models can reliably detect distribution shift in safety-relevant applications.

Cited by 1 page

Page	Type	Quality
AI Distributional Shift	Risk	91.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202659 KB

[2404.10474] Toward a Realistic Benchmark for Out-of-Distribution Detection 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Toward a Realistic Benchmark for Out-of-Distribution Detection

 
 
 Pietro Recalcati, Fabio Garcea, Luca Piano, Fabrizio Lamberti, Lia Morra
 
 Department of Control and Computer Engineering 
 Politecnico di Torino
 Torino, Italy
 {name.surname}@polito.it 
 
 

 
 Abstract

 Deep neural networks are increasingly used in a wide range of technologies and services, but remain highly susceptible to out-of-distribution (OOD) samples, that is, drawn from a different distribution than the original training set. A common approach to address this issue is to endow deep neural networks with the ability to detect OOD samples. Several benchmarks have been proposed to design and validate OOD detection techniques. However, many of them are based on far-OOD samples drawn from very different distributions, and thus lack the complexity needed to capture the nuances of real-world scenarios. In this work, we introduce a comprehensive benchmark for OOD detection, based on ImageNet and Places365, that assigns individual classes as in-distribution or out-of-distribution depending on the semantic similarity with the training set. Several techniques can be used to determine which classes should be considered in-distribution, yielding benchmarks with varying properties. Experimental results on different OOD detection techniques show how their measured efficacy depends on the selected benchmark and how confidence-based techniques may outperform classifier-based ones on near-OOD samples.

 
 
 Index Terms: 

Out-of-Distribution Detection, Deep Learning, Convolutional Neural Networks, Open-World recognition

 
 
 
 I Introduction 

 
 Deep convolutional networks (CNNs) are powerful classifiers when tested on in-distribution (ID) images sampled from the same distribution the network was trained on. However, being trained under a closed-world assumption, they may fail by producing overconfident and wrong results when faced with out-of-distribution (OOD) samples, such as images belonging to classes previously unseen by the model. There is a strong interest in making CNN classifiers more robust by endowing them with the capability to separate samples drawn from a given distribution (also known as inliers, in-distribution or ID samples) from the others (also denoted as outliers, out-of-distribution, OOD, anomalies, novelties, or out-of-domain samples) [ 1 , 2 , 3 , 4 ] .

 
 
 As a motivating example, let us consider the automatic tagging of images from social media platforms such as Facebook or Instagram, with applications in social sciences [ 5 ] , digital humanities [ 6 , 7 ] , marketing [ 8 ] , etc. Researchers may want to exploit readily available, pre-trained models for automatic classification and tagging, which however entails operating under an open-world assumption. Hence, it is necessary to detect OOD samples that may lead to overconfident and wrong

... (truncated, 59 KB total)

Resource ID: ebfbc03c42817362 | Stable ID: sid_4SuKwLUOw5