Transformer-XL
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Transformer-XL introduces architectural improvements to handle longer-term dependencies in language models, relevant to understanding transformer capabilities and limitations that impact AI safety considerations around model behavior and interpretability.
Paper Details
Metadata
Abstract
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.
Summary
Transformer-XL addresses the fixed-length context limitation of standard Transformers in language modeling by introducing segment-level recurrence and a novel positional encoding scheme. This architecture enables models to learn dependencies 450% longer than vanilla Transformers while maintaining temporal coherence and avoiding context fragmentation. The approach achieves state-of-the-art results across multiple benchmarks (enwik8, text8, WikiText-103, etc.), demonstrates 1,800+ times faster evaluation speed, and can generate coherent text spanning thousands of tokens.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
Cached Content Preview
# Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai∗12, Zhilin Yang∗12, Yiming Yang1, Jaime Carbonell1,
Quoc V. Le2, Ruslan Salakhutdinov1
1Carnegie Mellon University, 2Google Brain
{dzihang,zhiliny,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com
###### Abstract
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning).
When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch111 [https://github.com/kimiyoung/transformer-xl](https://github.com/kimiyoung/transformer-xl "").
11footnotetext: Equal contribution. Order determined by swapping the one in Yang et al. ( [2017](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib62 "")).
## 1 Introduction
Language modeling is among the important problems that require modeling long-term dependency, with successful applications such as unsupervised pretraining (Dai and Le, [2015](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib9 ""); Peters et al., [2018](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib47 ""); Radford et al., [2018](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib50 ""); Devlin et al., [2018](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib11 "")).
However, it has been a challenge to equip neural networks with the capability to model long-term dependency in sequential data.
Recurrent neural networks (RNNs), in particular Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, [1997](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib20 "")), have been a standard solution to language modeling and obtained strong results on multiple benchmarks.
Despite the wide adaption, RNNs are difficult to optimize due to gradient vanishing and explosion (Hochreiter et al., [2001](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib19 "")), and the introduction of gating in LSTMs and the gradient clipping technique (Graves, [2013](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib16
... (truncated, 98 KB total)5b39694ffd7eee39 | Stable ID: N2I2OWZiYT