Bahdanau attention, introduced in Bahdanau, Cho, and Bengio (2014), was the first formulation of attention in a neural network. It referenced almost immediately as an emerging direction in LeCun, Bengio, and Hinton (2015). (This is unsurprising, as Bengio was an author in both.)

Bahdanau attention is called “additive” because the scalar attention function is learned via a feedforward neural network that takes an encoder hidden state and a decoder hidden state as input:

where , , , and are learned matrices/vectors.

The set of all for the th decoder state are then softmaxed such that we can compute a weighted sum of the encoder states . The resulting vector is the context vector for the following decoder state :

where is the length of the input sequence (and hence the number of encoder hidden states).