Skip to content
Longterm Wiki
Back

Transformer-XL

paper

Authors

Zihang Dai·Zhilin Yang·Yiming Yang·Jaime Carbonell·Quoc V. Le·Ruslan Salakhutdinov

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Transformer-XL introduces architectural improvements to handle longer-term dependencies in language models, relevant to understanding transformer capabilities and limitations that impact AI safety considerations around model behavior and interpretability.

Paper Details

Citations
0
435 influential
Year
2019

Metadata

arxiv preprintprimary source

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Summary

Transformer-XL addresses the fixed-length context limitation of standard Transformers in language modeling by introducing segment-level recurrence and a novel positional encoding scheme. This architecture enables models to learn dependencies 450% longer than vanilla Transformers while maintaining temporal coherence and avoiding context fragmentation. The approach achieves state-of-the-art results across multiple benchmarks (enwik8, text8, WikiText-103, etc.), demonstrates 1,800+ times faster evaluation speed, and can generate coherent text spanning thousands of tokens.

Cited by 1 page

PageTypeQuality
Long-Horizon Autonomous TasksCapability65.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Transformer-XL: Attentive Language Models    Beyond a Fixed-Length Context

Zihang Dai∗12, Zhilin Yang∗12, Yiming Yang1, Jaime Carbonell1,

Quoc V. Le2, Ruslan Salakhutdinov1

1Carnegie Mellon University, 2Google Brain

{dzihang,zhiliny,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com

###### Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning).
When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch111 [https://github.com/kimiyoung/transformer-xl](https://github.com/kimiyoung/transformer-xl "").

11footnotetext: Equal contribution. Order determined by swapping the one in Yang et al. ( [2017](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib62 "")).

## 1 Introduction

Language modeling is among the important problems that require modeling long-term dependency, with successful applications such as unsupervised pretraining (Dai and Le, [2015](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib9 ""); Peters et al., [2018](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib47 ""); Radford et al., [2018](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib50 ""); Devlin et al., [2018](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib11 "")).
However, it has been a challenge to equip neural networks with the capability to model long-term dependency in sequential data.
Recurrent neural networks (RNNs), in particular Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, [1997](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib20 "")), have been a standard solution to language modeling and obtained strong results on multiple benchmarks.
Despite the wide adaption, RNNs are difficult to optimize due to gradient vanishing and explosion (Hochreiter et al., [2001](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib19 "")), and the introduction of gating in LSTMs and the gradient clipping technique (Graves, [2013](https://ar5iv.labs.arxiv.org/html/1901.02860#bib.bib16 

... (truncated, 98 KB total)
Resource ID: 5b39694ffd7eee39 | Stable ID: N2I2OWZiYT