Emergent capability detection

paper

2023·arXiv·arxiv.org/abs/2304.14108

Authors

Samir Yitzhak Gadre·Gabriel Ilharco·Alex Fang·Jonathan Hayase·Georgios Smyrnis·Thao Nguyen·Ryan Marten·Mitchell Wortsman·Dhruba Ghosh·Jieyu Zhang·Eyal Orgad·Rahim Entezari·Giannis Daras·Sarah Pratt·Vivek Ramanujan·Yonatan Bitton·Kalyani Marathe·Stephen Mussmann·Richard Vencu·Mehdi Cherti·Ranjay Krishna·Pang Wei Koh·Olga Saukh·Alexander Ratner·Shuran Song·Hannaneh Hajishirzi·Ali Farhadi·Romain Beaumont·Sewoong Oh·Alex Dimakis·Jenia Jitsev·Yair Carmon·Vaishaal Shankar·Ludwig Schmidt

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces DataComp, a large-scale benchmark for dataset design with 12.8B image-text pairs, addressing how dataset curation impacts model capabilities and safety—relevant for understanding emergent abilities and data's role in AI system behavior.

Paper Details

Citations

645

91 influential

Year

2023

arXiv:2304.14108 DOI:10.48550/arXiv.2304.14108 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.

Summary

DataComp is a new benchmark testbed for dataset design and curation in multimodal machine learning, addressing the lack of research attention on datasets compared to model architectures. The benchmark provides a 12.8 billion image-text pair candidate pool from Common Crawl and enables researchers to design filtering techniques or curate data sources, then evaluate results using standardized CLIP training across 38 downstream tasks. Spanning four orders of magnitude in compute scales, DataComp makes dataset research accessible to researchers with varying resources. The authors demonstrate that their best baseline (DataComp-1B) achieves 79.2% zero-shot ImageNet accuracy with CLIP ViT-L/14, outperforming OpenAI's CLIP by 3.7 percentage points using identical training procedures.

Cited by 1 page

Page	Type	Quality
AI Evaluation	Approach	72.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2304.14108] DataComp: In search of the next generation of multimodal datasets 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 0 0 footnotetext: Equal contribution, randomly ordered. Correspondence to contact@datacomp.ai .
 1 University of Washington
 2 Columbia University
 3 Tel Aviv University
 4 Apple
 5 UT Austin
 6 LAION
 7 AI2
 8 Juelich Supercomputing Center, Research Center Juelich
 9 University of Illinois Urbana-Champaign
 10 Graz University of Technology
 11 Hebrew University
 12 Google Research
 13 Snorkel AI
 
 
 DataComp :
 In search of the next generation of multimodal datasets

 
 
 

 Samir Yitzhak Gadre* 2 , Gabriel Ilharco* 1 , Alex Fang* 1 , Jonathan Hayase 1 , 
 Georgios Smyrnis 5 , Thao Nguyen 1 , Ryan Marten 7,9 , Mitchell Wortsman 1 ,
 Dhruba Ghosh 1 , Jieyu Zhang 1 , Eyal Orgad 3 , Rahim Entezari 10 , Giannis Daras 5 ,
 Sarah Pratt 1 , Vivek Ramanujan 1 , Yonatan Bitton 11 , Kalyani Marathe 1 ,
 Stephen Mussmann 1 , Richard Vencu 6 , Mehdi Cherti 6,8 , Ranjay Krishna 1 ,
 Pang Wei Koh 1,12 , Olga Saukh 10 , Alexander Ratner 1,13 , Shuran Song 2 ,
 Hannaneh Hajishirzi 1,7 , Ali Farhadi 1 , Romain Beaumont 6 ,
 Sewoong Oh 1 , Alex Dimakis 5 , Jenia Jitsev 6,8 ,
 Yair Carmon 3 , Vaishaal Shankar 4 , Ludwig Schmidt 1,6,7 
 
 
 

 
 Abstract

 Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms.
To address this shortcoming in the machine learning ecosystem, we introduce DataComp , a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets.
Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources.
Our baseline experiments show that the DataComp workflow leads to better training sets.
Our best baseline, DataComp -1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute.
We release DataComp and all accompanying code at www.datacomp.ai .

 
 
 
 1 Introduction

 
 Recent advances in multimodal learning such as CLIP  [ 111 ] , DALL-E  [ 115 , 116 ] , Stable Diffusion  [ 123 ] , Flamingo  [ 8 ] , and GPT-4  [ 103 ] offer unprecedented generalization capabilities in zero-shot classification, image generation, and in-context learning.
While these advances use different algorithmic techniques, e.g., contrastive learning, diffusion, or auto-regressive modeling, they all rest on a common foundation: large d

... (truncated, 98 KB total)

Resource ID: aa5d540c12c0114d | Stable ID: sid_SBci8zKaav