Cross-attention (also called “encoder-decoder attention”) is one of two applications of scaled dot-product attention in the Vaswani transformer model, the other being self-attention. It is used only in the decoder.
Recall that scaled dot-product attention is defined as
where
In self-attention,
By contrast, in cross-attention, the keys and values correspond to positions in the output of (the last layer of) the encoder. The resulting context vector
In machine translation, this correspondence would serve the purpose of sequence alignment with even a single attention block each in the encoder and the decoder. However, each attention block learns a more abstract feature set than the last (up to a point). As a result, such a context vector can do much more than simple sequence alignment: it allows the model to associate the semantics of the input sequence with the current token in the output sequence. This property is foundational to the remarkable generative capabilities of large language models.
Because cross-attention does not attend to the decoder state, cross-attention is not masked.