Attention mechanisms papers

paper

2014·arXiv·arxiv.org/abs/1409.0473

Authors

Dzmitry Bahdanau·Kyunghyun Cho·Yoshua Bengio

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Seminal paper introducing attention mechanisms in neural machine translation, a foundational technique that enables models to focus on relevant input parts and is critical to understanding modern large language models and their capabilities.

Paper Details

Citations

29,135

2585 influential

Year

2014

arXiv:1409.0473 DOI:10.7717/peerjcs.2169/fig-5 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

Summary

This paper introduces the attention mechanism for neural machine translation, addressing a key limitation of encoder-decoder architectures: the fixed-length vector bottleneck. The authors propose allowing models to automatically search for and focus on relevant parts of the source sentence when generating each target word, rather than compressing all information into a single fixed-length representation. This approach achieves state-of-the-art performance on English-to-French translation and produces interpretable soft alignments that align with linguistic intuition.

Cited by 1 page

Page	Type	Quality
Yoshua Bengio	Person	39.0

Cached Content Preview

HTTP 200Fetched May 4, 202664 KB

Neural Machine Translation 
 by Jointly Learning to Align and Translate

 
 
 
Dzmitry Bahdanau 
 Jacobs University Bremen, Germany
&KyungHyun Cho       Yoshua Bengio 
 Université de Montréal
 CIFAR Senior Fellow 
 

 
 Abstract

 Neural machine translation is a recently proposed approach to machine
translation. Unlike the traditional statistical machine translation, the
neural machine translation aims at building a single neural network that
can be jointly tuned to maximize the translation performance. The models
proposed recently for neural machine translation often belong to a family
of encoder–decoders and encode a source sentence into a fixed-length
vector from which a decoder generates a translation. In this paper, we
conjecture that the use of a fixed-length vector is a bottleneck in
improving the performance of this basic encoder–decoder architecture, and
propose to extend this by allowing a model to automatically
 (soft-)search for parts of a source sentence that are relevant to
predicting a target word, without having to form these parts as a hard
segment explicitly. With this new approach, we achieve a translation
performance comparable to the existing state-of-the-art phrase-based system
on the task of English-to-French translation. Furthermore, qualitative
analysis reveals that the (soft-)alignments found by the model agree
well with our intuition.

 
 
 
 1 Introduction

 
 Neural machine translation is a newly emerging approach to machine
translation, recently proposed by Kalchbrenner and Blunsom ( 2013 ) ,
 Sutskever et al. ( 2014 ) and Cho et al. ( 2014b ) . Unlike the traditional phrase-based
translation system  (see, e.g., Koehn et al. , 2003 ) which consists of many small
sub-components that are tuned separately, neural machine translation attempts
to build and train a single, large neural network that reads a sentence and
outputs a correct translation.

 
 
 Most of the proposed neural machine translation models belong to a family of
 encoder–decoders   (Sutskever et al. , 2014 ; Cho et al. , 2014a ) , with an encoder and a
decoder for each language, or involve a language-specific encoder applied to
each sentence whose outputs are then compared  (Hermann and Blunsom, 2014 ) . An encoder
neural network reads and encodes a source sentence into a fixed-length vector.
A decoder then outputs a translation from the encoded vector. The whole
encoder–decoder system, which consists of the encoder and the decoder for a
language pair, is jointly trained to maximize the probability of a correct
translation given a source sentence.

 
 
 A potential issue with this encoder–decoder approach is that a neural network
needs to be able to compress all the necessary information of a source sentence
into a fixed-length vector. This may make it difficult for the neural network
to cope with long sentences, especially those that are longer than the
sentences in the training corpus. Cho et al. ( 2014b ) showed that indeed the
performance of a basic 

... (truncated, 64 KB total)

Resource ID: c23bf2a3bc33cb7a | Stable ID: sid_LaXZj2fJt9