Dot-product (multiplicative) attention

Dot-product attention, introduced in Luong, Pham, and Manning (2015), is a computationally simpler formulation of attention than the additive attention introduced before it. Rather than learning a feedforward network and then applying it for a given decoder state to each hidden state of an encoder separately, it simply takes the dot product of the relevant encoder and decoder states:

Like additive attention, it requires comparisons per decoder state, resulting in approximately quadratic time complexity if we assume that the source and target sequences are approximately the same length. However, since dot-product attention can be computed in batches on a GPU using only matrix operations, in practice it is vastly faster.

Notably, this version of attention had not yet decoupled the “key” and “value” representation of the encoder hidden states; that came in Vaswani, et al. (2017).

David's raw ML reference notes

Explorer

Dot-product (multiplicative) attention

Graph View

Backlinks