Radford et al., "Improving Language Understanding by Generative Pre-Training" (OpenAI, 2018).
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
GPT-1 is the foundational paper that launched the GPT series; understanding it is essential context for evaluating the capabilities trajectory of large language models and their AI safety implications.
Metadata
Summary
This paper introduces GPT-1, demonstrating that generative pre-training of a language model on large unlabeled text corpora followed by discriminative fine-tuning on specific tasks yields strong performance across diverse NLP benchmarks. It established the foundational paradigm of unsupervised pre-training plus supervised fine-tuning that underpins modern large language models. The work showed that transformer-based models can learn general-purpose language representations transferable to downstream tasks with minimal task-specific architecture changes.
Key Points
- •Introduced GPT-1: a transformer decoder pre-trained with a language modeling objective on BooksCorpus (~7,000 books).
- •Demonstrated that generative pre-training followed by task-specific fine-tuning outperforms task-specific architectures on 9 of 12 NLP benchmarks.
- •Established the foundational pre-train/fine-tune paradigm that directly led to GPT-2, GPT-3, and subsequent large language models.
- •Showed that unsupervised pre-training enables models to learn useful linguistic and world-knowledge representations without labeled data.
- •Highlighted the importance of model scale and training data diversity for transferable language representations.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
Cached Content Preview
Improving language understanding with unsupervised learning \| OpenAI
June 11, 2018
[Milestone](https://openai.com/research/index/milestone/)
# Improving language understanding with unsupervised learning
[Read paper(opens in a new window)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) [(opens in a new window)](https://github.com/openai/finetune-transformer-lm)

Illustration: Ben Barry
Loading…
Share
We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: [transformers(opens in a new window)](https://arxiv.org/abs/1706.03762) and [unsupervised pre-training(opens in a new window)](https://arxiv.org/abs/1511.01432). These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.
| | | | |
| --- | --- | --- | --- |
| **Dataset** | **Task** | **SOTA** | **Ours** |
| SNLI | Textual entailment | 89.3 | 89.9 |
| MNLI matched | Textual entailment | 80.6 | 82.1 |
| MNLI mismatched | Textual entailment | 80.1 | 81.4 |
| SciTail | Textual entailment | 83.3 | 88.3 |
| QNLI | Textual entailment | 82.3 | 88.1 |
| RTE | Textual entailment | 61.7 | 56.0 |
| STS-B | Semantic similarity | 81.0 | 82.0 |
| QQP | Semantic similarity | 66.1 | 70.3 |
| MRPC | Semantic similarity | 86.0 | 82.3 |
| RACE | Reading comprehension | 53.3 | 59.0 |
| ROCStories | Commonsense reasoning | 77.6 | 86.5 |
| COPA | Commonsense reasoning | 71.2 | 78.6 |
| SST-2 | Sentiment analysis | 93.2 | 91.3 |
| CoLA | Linguistic acceptability | 35.0 | 45.4 |
| GLUE | Multi task benchmark | 68.9 | 72.8 |
Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner—using language modeling as a training signal—then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks. We developed this approach following our [sentiment neuron](https://openai.com/index/unsupervised-sentiment-neuron/) work, in which we noted that unsupervised learning techniques can yield surprisingly discriminative features when trained on enough data. Here, we wanted to further explore this idea: can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.
This work build
... (truncated, 14 KB total)8a20a28f94410f17 | Stable ID: MTVhNmE0OT