Skip to content
Longterm Wiki
Back

Radford et al., "Improving Language Understanding by Generative Pre-Training" (OpenAI, 2018).

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

GPT-1 is the foundational paper that launched the GPT series; understanding it is essential context for evaluating the capabilities trajectory of large language models and their AI safety implications.

Metadata

Importance: 78/100blog postprimary source

Summary

This paper introduces GPT-1, demonstrating that generative pre-training of a language model on large unlabeled text corpora followed by discriminative fine-tuning on specific tasks yields strong performance across diverse NLP benchmarks. It established the foundational paradigm of unsupervised pre-training plus supervised fine-tuning that underpins modern large language models. The work showed that transformer-based models can learn general-purpose language representations transferable to downstream tasks with minimal task-specific architecture changes.

Key Points

  • Introduced GPT-1: a transformer decoder pre-trained with a language modeling objective on BooksCorpus (~7,000 books).
  • Demonstrated that generative pre-training followed by task-specific fine-tuning outperforms task-specific architectures on 9 of 12 NLP benchmarks.
  • Established the foundational pre-train/fine-tune paradigm that directly led to GPT-2, GPT-3, and subsequent large language models.
  • Showed that unsupervised pre-training enables models to learn useful linguistic and world-knowledge representations without labeled data.
  • Highlighted the importance of model scale and training data diversity for transferable language representations.

Cited by 1 page

PageTypeQuality
Large Language ModelsCapability60.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202614 KB
Improving language understanding with unsupervised learning \| OpenAI

June 11, 2018

[Milestone](https://openai.com/research/index/milestone/)

# Improving language understanding with unsupervised learning

[Read paper(opens in a new window)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) [(opens in a new window)](https://github.com/openai/finetune-transformer-lm)

![Language Unsupervised](https://images.ctfassets.net/kftzwdyauwt9/bbab3cf5-a807-49f0-9f93ff442ad7/2cb463c19ae4d17e661a8a18ce574f6b/image-19.webp?w=3840&q=90&fm=webp)

Illustration: Ben Barry

Loading…

Share

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: [transformers⁠(opens in a new window)](https://arxiv.org/abs/1706.03762) and [unsupervised pre-training⁠(opens in a new window)](https://arxiv.org/abs/1511.01432). These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.

|     |     |     |     |
| --- | --- | --- | --- |
| **Dataset** | **Task** | **SOTA** | **Ours** |
| SNLI | Textual entailment | 89.3 | 89.9 |
| MNLI matched | Textual entailment | 80.6 | 82.1 |
| MNLI mismatched | Textual entailment | 80.1 | 81.4 |
| SciTail | Textual entailment | 83.3 | 88.3 |
| QNLI | Textual entailment | 82.3 | 88.1 |
| RTE | Textual entailment | 61.7 | 56.0 |
| STS-B | Semantic similarity | 81.0 | 82.0 |
| QQP | Semantic similarity | 66.1 | 70.3 |
| MRPC | Semantic similarity | 86.0 | 82.3 |
| RACE | Reading comprehension | 53.3 | 59.0 |
| ROCStories | Commonsense reasoning | 77.6 | 86.5 |
| COPA | Commonsense reasoning | 71.2 | 78.6 |
| SST-2 | Sentiment analysis | 93.2 | 91.3 |
| CoLA | Linguistic acceptability | 35.0 | 45.4 |
| GLUE | Multi task benchmark | 68.9 | 72.8 |

Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner—using language modeling as a training signal—then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks. We developed this approach following our [sentiment neuron⁠](https://openai.com/index/unsupervised-sentiment-neuron/) work, in which we noted that unsupervised learning techniques can yield surprisingly discriminative features when trained on enough data. Here, we wanted to further explore this idea: can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.

This work build

... (truncated, 14 KB total)
Resource ID: 8a20a28f94410f17 | Stable ID: MTVhNmE0OT