Bahdanau attention is called “additive” because the scalar attention function is learned via a feedforward neural network that takes an encoder hidden state and a decoder hidden state as input:
where , , , and are learned matrices/vectors.
The set of all for the th decoder state are then softmaxed such that we can compute a weighted sum of the encoder states . The resulting vector is the context vector for the following decoder state :
where is the length of the input sequence (and hence the number of encoder hidden states).