Skip to content
Longterm Wiki
Back

Dynabench: Rethinking Benchmarking in NLP

paper

Authors

Douwe Kiela·Max Bartolo·Yixin Nie·Divyansh Kaushik·Atticus Geiger·Zhengxuan Wu·Bertie Vidgen·Grusha Prasad·Amanpreet Singh·Pratik Ringshia·Zhiyi Ma·Tristan Thrush·Sebastian Riedel·Zeerak Waseem·Pontus Stenetorp·Robin Jia·Mohit Bansal·Christopher Potts·Adina Williams

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety discussions around evaluating whether models are truly capable or exploiting benchmark artifacts; dynamic adversarial benchmarking is a method for stress-testing model robustness and identifying capability gaps.

Paper Details

Citations
0
34 influential
Year
2021

Metadata

Importance: 62/100arxiv preprintprimary source

Abstract

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

Summary

Wang et al. (2021) introduce Dynabench, an open-source platform for dynamic, adversarial benchmark creation using human-and-model-in-the-loop annotation, where annotators craft examples that fool target models but remain interpretable to humans. The platform addresses benchmark saturation—where models achieve superhuman performance on static benchmarks yet fail on simple adversarial examples and real-world tasks—by creating a continuous feedback loop between dataset creation, model development, and evaluation.

Key Points

  • Introduces Dynabench, a web-based platform enabling adversarial dataset creation where humans try to fool models, producing harder and more informative benchmarks.
  • Addresses benchmark saturation: NLP models routinely surpass human-level performance on static benchmarks but fail on simple challenge examples and real-world scenarios.
  • Static benchmarks often contain statistical and social biases that make them artificially easy and misaligned with true capability goals.
  • Demonstrates dynamic benchmarking on four NLP tasks, showing the approach yields more robust evaluation and continuous improvement loops.
  • Proposes dynamic benchmarking as a new community standard, directly coupling dataset creation, model training, and model assessment.

Cited by 1 page

PageTypeQuality
AI-Human Hybrid SystemsApproach91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202683 KB
# Dynabench: Rethinking Benchmarking in NLP

Douwe Kiela†, Max Bartolo‡, Yixin Nie⋆, Divyansh Kaushik§, Atticus Geiger¶,
\\ANDZhengxuan Wu¶, Bertie Vidgen∥, Grusha Prasad⋆⋆, Amanpreet Singh†, Pratik Ringshia†,
\\ANDZhiyi Ma†, Tristan Thrush†, Sebastian Riedel†‡, Zeerak Waseem††, Pontus Stenetorp‡,
\\ANDRobin Jia†, Mohit Bansal⋆, Christopher Potts¶ and Adina Williams†

† Facebook AI Research; ‡ UCL; ⋆ UNC Chapel Hill; § CMU; ¶ Stanford University

∥ Alan Turing Institute; ⋆⋆ JHU; †† Simon Fraser University

dynabench@fb.com

###### Abstract

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.
With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

## 1 Introduction

While it used to take decades for machine learning models to surpass estimates of human performance on benchmark tasks, that milestone is now routinely reached within just a few years for newer datasets (see Figure [1](https://ar5iv.labs.arxiv.org/html/2104.14337#S1.F1 "Figure 1 ‣ 1 Introduction ‣ Dynabench: Rethinking Benchmarking in NLP")). As with the rest of AI, NLP has advanced rapidly thanks to improvements in computational power, as well as algorithmic breakthroughs, ranging from attention mechanisms Bahdanau et al. ( [2014](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib3 "")); Luong et al. ( [2015](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib43 "")), to Transformers Vaswani et al. ( [2017](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib86 "")), to pre-trained language models Howard and Ruder ( [2018](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib29 "")); Devlin et al. ( [2019](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib14 "")); Liu et al. ( [2019b](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib42 "")); Radford et al. ( [2019](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib62 "")); Brown et al. ( [2020](https://ar5iv.labs.arxiv.org/html/2104.14337#bib.bib8 "")). Equally important has been the rise of benchmarks that support the development of ambitious new data-driven models and that encourage apples-to-apples model comparisons. Benchmarks provide a north star goal for researchers, and are part of the reason we can confidently say we have made great st

... (truncated, 83 KB total)
Resource ID: 2330e26d7254e387 | Stable ID: MjZjZGI4Nj