Emergent capability detection
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Introduces DataComp, a large-scale benchmark for dataset design with 12.8B image-text pairs, addressing how dataset curation impacts model capabilities and safety—relevant for understanding emergent abilities and data's role in AI system behavior.
Paper Details
Metadata
Abstract
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.
Summary
DataComp is a new benchmark testbed for dataset design and curation in multimodal machine learning, addressing the lack of research attention on datasets compared to model architectures. The benchmark provides a 12.8 billion image-text pair candidate pool from Common Crawl and enables researchers to design filtering techniques or curate data sources, then evaluate results using standardized CLIP training across 38 downstream tasks. Spanning four orders of magnitude in compute scales, DataComp makes dataset research accessible to researchers with varying resources. The authors demonstrate that their best baseline (DataComp-1B) achieves 79.2% zero-shot ImageNet accuracy with CLIP ViT-L/14, outperforming OpenAI's CLIP by 3.7 percentage points using identical training procedures.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Evaluation | Approach | 72.0 |
Cached Content Preview
00footnotetext: Equal contribution, randomly ordered. Correspondence to [contact@datacomp.ai](https://ar5iv.labs.arxiv.org/html/contact@datacomp.ai "").
1University of Washington
2Columbia University
3Tel Aviv University
4Apple
5UT Austin
6LAION
7AI2
8Juelich Supercomputing Center, Research Center Juelich
9University of Illinois Urbana-Champaign
10Graz University of Technology
11Hebrew University
12Google Research
13Snorkel AI
# DataComp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre\*2, Gabriel Ilharco\*1, Alex Fang\*1, Jonathan Hayase1,
Georgios Smyrnis5, Thao Nguyen1, Ryan Marten7,9, Mitchell Wortsman1,
Dhruba Ghosh1, Jieyu Zhang1, Eyal Orgad3, Rahim Entezari10, Giannis Daras5,
Sarah Pratt1, Vivek Ramanujan1, Yonatan Bitton11, Kalyani Marathe1,
Stephen Mussmann1, Richard Vencu6, Mehdi Cherti6,8, Ranjay Krishna1,
Pang Wei Koh1,12, Olga Saukh10, Alexander Ratner1,13, Shuran Song2,
Hannaneh Hajishirzi1,7, Ali Farhadi1, Romain Beaumont6,
Sewoong Oh1, Alex Dimakis5, Jenia Jitsev6,8,
Yair Carmon3, Vaishaal Shankar4, Ludwig Schmidt1,6,7
###### Abstract
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms.
To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets.
Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources.
Our baseline experiments show that the DataComp workflow leads to better training sets.
Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute.
We release DataComp and all accompanying code at [www.datacomp.ai](https://ar5iv.labs.arxiv.org/html/www.datacomp.ai "").
### 1 Introduction
Recent advances in multimodal learning such as CLIP \[ [111](https://ar5iv.labs.arxiv.org/html/2304.14108#bib.bib111 "")\], DALL-E \[ [115](https://ar5iv.labs.arxiv.org/html/2304.14108#bib.bib115 ""), [116](https://ar5iv.labs.arxiv.org/html/2304.14108#bib.bib116 "")\], Stable Diffusion \[ [123](https://ar5iv.labs.arxiv.org/html/2304.14108#bib.bib123 "")\], Flamingo \[ [8](https://ar5iv.labs.arxiv.org/html/2304.14108#bib.bib8 "")\], and GPT-4 \[ [103](https://ar5iv.labs.arxiv.org/html/2304.14108#bib.bib103 "")\] offer unprecedented generalization capabilities
... (truncated, 98 KB total)aa5d540c12c0114d | Stable ID: OGNhZjJkM2