Dynabench: Rethinking Benchmarking in NLP

paper

2021·arXiv·arxiv.org/abs/2104.14337

Authors

Douwe Kiela·Max Bartolo·Yixin Nie·Divyansh Kaushik·Atticus Geiger·Zhengxuan Wu·Bertie Vidgen·Grusha Prasad·Amanpreet Singh·Pratik Ringshia·Zhiyi Ma·Tristan Thrush·Sebastian Riedel·Zeerak Waseem·Pontus Stenetorp·Robin Jia·Mohit Bansal·Christopher Potts·Adina Williams

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety discussions around evaluating whether models are truly capable or exploiting benchmark artifacts; dynamic adversarial benchmarking is a method for stress-testing model robustness and identifying capability gaps.

Paper Details

Citations

34 influential

Year

2021

arXiv:2104.14337 DOI:10.7717/peerj.18488/fig-2 Semantic Scholar

Metadata

Importance: 62/100arxiv preprintprimary source

Abstract

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

Summary

Wang et al. (2021) introduce Dynabench, an open-source platform for dynamic, adversarial benchmark creation using human-and-model-in-the-loop annotation, where annotators craft examples that fool target models but remain interpretable to humans. The platform addresses benchmark saturation—where models achieve superhuman performance on static benchmarks yet fail on simple adversarial examples and real-world tasks—by creating a continuous feedback loop between dataset creation, model development, and evaluation.

Key Points

•Introduces Dynabench, a web-based platform enabling adversarial dataset creation where humans try to fool models, producing harder and more informative benchmarks.
•Addresses benchmark saturation: NLP models routinely surpass human-level performance on static benchmarks but fail on simple challenge examples and real-world scenarios.
•Static benchmarks often contain statistical and social biases that make them artificially easy and misaligned with true capability goals.
•Demonstrates dynamic benchmarking on four NLP tasks, showing the approach yields more robust evaluation and continuous improvement loops.
•Proposes dynamic benchmarking as a new community standard, directly coupling dataset creation, model training, and model assessment.

Cited by 1 page

Page	Type	Quality
AI-Human Hybrid Systems	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202671 KB

[2104.14337] Dynabench: Rethinking Benchmarking in NLP 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Dynabench: Rethinking Benchmarking in NLP

 
 
 
Douwe Kiela † , Max Bartolo ‡ , Yixin Nie ⋆ , Divyansh Kaushik § , Atticus Geiger ¶ ,
 \AND Zhengxuan Wu ¶ , Bertie Vidgen ∥ , Grusha Prasad ⋆⋆ , Amanpreet Singh † , Pratik Ringshia † ,
 \AND Zhiyi Ma † , Tristan Thrush † , Sebastian Riedel †‡ , Zeerak Waseem †† , Pontus Stenetorp ‡ ,
 \AND Robin Jia † , Mohit Bansal ⋆ , Christopher Potts ¶ and Adina Williams † 
 
 † Facebook AI Research; ‡ UCL; ⋆ UNC Chapel Hill; § CMU; ¶ Stanford University
 ∥ Alan Turing Institute; ⋆⋆ JHU; †† Simon Fraser University
 dynabench@fb.com
 
 
 

 
 Abstract

 We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.
With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

 
 
 
 1 Introduction

 
 While it used to take decades for machine learning models to surpass estimates of human performance on benchmark tasks, that milestone is now routinely reached within just a few years for newer datasets (see Figure  1 ). As with the rest of AI, NLP has advanced rapidly thanks to improvements in computational power, as well as algorithmic breakthroughs, ranging from attention mechanisms  Bahdanau et al. ( 2014 ); Luong et al. ( 2015 ) , to Transformers  Vaswani et al. ( 2017 ) , to pre-trained language models  Howard and Ruder ( 2018 ); Devlin et al. ( 2019 ); Liu et al. ( 2019b ); Radford et al. ( 2019 ); Brown et al. ( 2020 ) . Equally important has been the rise of benchmarks that support the development of ambitious new data-driven models and that encourage apples-to-apples model comparisons. Benchmarks provide a north star goal for researchers, and are part of the reason we can confidently say we have made great strides in our field.

 
 
 In light of these developments,
one might be forgiven for thinking that NLP has created models with human-like language capabilities. Practitioners know that, despite our progress, we are actually far from this goal. Models that achieve super-human performance on benchmark tasks (according to the narrow criteria used to define human performance) nonetheless fail on simple challenge examples and falter in real-worl

... (truncated, 71 KB total)

Resource ID: 2330e26d7254e387 | Stable ID: sid_wN35LIDYe5