Skip to content
Longterm Wiki
Back

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - ACL Anthology

web
aclanthology.org·aclanthology.org/N19-1423

A landmark NLP capabilities paper; understanding BERT is background knowledge for interpreting modern LLM safety research, as many alignment and interpretability studies build directly on transformer architectures introduced or popularized here.

Metadata

Importance: 72/100conference paperprimary source

Summary

BERT (Bidirectional Encoder Representations from Transformers) introduces a language model pre-training approach using masked language modeling and next-sentence prediction objectives, enabling deep bidirectional representations. Fine-tuned on a single additional output layer, BERT achieved state-of-the-art results across eleven NLP benchmarks at the time of publication. It became a foundational architecture underlying many subsequent large language models relevant to AI capabilities and safety research.

Key Points

  • Introduces bidirectional pre-training via masked token prediction, contrasting with left-to-right or shallow concatenation approaches used previously.
  • Demonstrates that a single pre-trained model can be fine-tuned for diverse NLP tasks with minimal task-specific architecture changes.
  • Achieved SOTA on GLUE, SQuAD, and other benchmarks, establishing pre-training + fine-tuning as the dominant NLP paradigm.
  • Foundational to understanding the capabilities of large language models and their scaling behavior, relevant to AI safety capability assessments.
  • Spurred extensive interpretability and probing research examining what linguistic and factual knowledge is encoded in transformer representations.

Cited by 1 page

PageTypeQuality
Deep Learning Revolution EraHistorical44.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20264 KB
## [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423.pdf)

[Jacob Devlin](https://aclanthology.org/people/jacob-devlin/unverified/),
[Ming-Wei Chang](https://aclanthology.org/people/ming-wei-chang/unverified/),
[Kenton Lee](https://aclanthology.org/people/kenton-lee/unverified/),
[Kristina Toutanova](https://aclanthology.org/people/kristina-toutanova/unverified/)

* * *

##### Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Anthology ID:N19-1423Volume:[Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)](https://aclanthology.org/volumes/N19-1/)Month:JuneYear:2019Address:Minneapolis, MinnesotaEditors:[Jill Burstein](https://aclanthology.org/people/jill-burstein/unverified/),
[Christy Doran](https://aclanthology.org/people/christy-doran/unverified/),
[Thamar Solorio](https://aclanthology.org/people/thamar-solorio/)Venue:[NAACL](https://aclanthology.org/venues/naacl/ "Nations of the Americas Chapter of the Association for Computational Linguistics")SIG:Publisher:Association for Computational LinguisticsNote:Pages:4171–4186Language:URL:[https://aclanthology.org/N19-1423/](https://aclanthology.org/N19-1423/)DOI:[10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423 "To the current version of the paper by DOI")Award: Best Long PaperBibkey:devlin-etal-2019-bertCite (ACL):Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423/). In _Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)_, pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.Cite (Informal):[BERT: Pre-trai

... (truncated, 4 KB total)
Resource ID: 80b6364ef8bbf595 | Stable ID: MTZhZDlhZm