Dot-product attention, introduced in Luong, Pham, and Manning (2015), is a computationally simpler formulation of attention than the additive attention introduced before it. Rather than learning a feedforward network and then applying it for a given decoder state
Like additive attention, it requires
Notably, this version of attention had not yet decoupled the “key” and “value” representation of the encoder hidden states; that came in Vaswani, et al. (2017).