Transformer-XL

paper

2019·arXiv·arxiv.org/abs/1901.02860

Authors

Zihang Dai·Zhilin Yang·Yiming Yang·Jaime Carbonell·Quoc V. Le·Ruslan Salakhutdinov

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Transformer-XL introduces architectural improvements to handle longer-term dependencies in language models, relevant to understanding transformer capabilities and limitations that impact AI safety considerations around model behavior and interpretability.

Paper Details

Citations

435 influential

Year

2019

arXiv:1901.02860 DOI:10.7717/peerj-cs.2940/table-7 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Summary

Transformer-XL addresses the fixed-length context limitation of standard Transformers in language modeling by introducing segment-level recurrence and a novel positional encoding scheme. This architecture enables models to learn dependencies 450% longer than vanilla Transformers while maintaining temporal coherence and avoiding context fragmentation. The approach achieves state-of-the-art results across multiple benchmarks (enwik8, text8, WikiText-103, etc.), demonstrates 1,800+ times faster evaluation speed, and can generate coherent text spanning thousands of tokens.

Cited by 1 page

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0

Cached Content Preview

HTTP 200Fetched May 11, 202698 KB

Transformer-XL: Attentive Language Models 
 Beyond a Fixed-Length Context

 
 
 Zihang Dai ∗12 , Zhilin Yang ∗12 , Yiming Yang 1 , Jaime Carbonell 1 , 
 Quoc V. Le 2 , Ruslan Salakhutdinov 1 
 1 Carnegie Mellon University, 2 Google Brain 
 {dzihang,zhiliny,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com 
 
 

 
 Abstract

 Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning).
When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch 1 1 1 https://github.com/kimiyoung/transformer-xl . 

 
 1 1 footnotetext: Equal contribution. Order determined by swapping the one in Yang et al. ( 2017 ) . 
 
 
 1 Introduction

 
 Language modeling is among the important problems that require modeling long-term dependency, with successful applications such as unsupervised pretraining  (Dai and Le, 2015 ; Peters et al., 2018 ; Radford et al., 2018 ; Devlin et al., 2018 ) .
However, it has been a challenge to equip neural networks with the capability to model long-term dependency in sequential data.
Recurrent neural networks (RNNs), in particular Long Short-Term Memory (LSTM) networks  (Hochreiter and Schmidhuber, 1997 ) , have been a standard solution to language modeling and obtained strong results on multiple benchmarks.
Despite the wide adaption, RNNs are difficult to optimize due to gradient vanishing and explosion  (Hochreiter et al., 2001 ) , and the introduction of gating in LSTMs and the gradient clipping technique  (Graves, 2013 ) might not be sufficient to fully address this issue.
Empirically, previous work has found that LSTM language models use 200 context words on average  (Khandelwal et al., 2018 ) , indicating room for further improvement.

 
 
 On the other hand, the direct connections between long-distance word pairs baked in attention mechanisms might ease optimization and enable the learning of long-term dependency  (Bahdanau et al., 2014 ; Vaswani et al., 2017 ) .
Recently, Al-Rfou et al. ( 

... (truncated, 98 KB total)

Resource ID: 5b39694ffd7eee39 | Stable ID: sid_Drnn90RQwf