Cross-attention (also called “encoder-decoder attention”) is one of two applications of scaled dot-product attention in the Vaswani transformer model, the other being self-attention. It is used only in the decoder.

Recall that scaled dot-product attention is defined as

where , , and represent matrices of “query” vectors, “key” vectors, and “value” vectors respectively.

In self-attention, , , and are all projections of the same sequence. As such, a context vector (“attention vector”) represents the expected value of the features at the -th position that sequence.

By contrast, in cross-attention, the keys and values correspond to positions in the output of (the last layer of) the encoder. The resulting context vector represents the expected value of the encoder output at the -th position in the decoder state.

In machine translation, this correspondence would serve the purpose of sequence alignment with even a single attention block each in the encoder and the decoder. However, each attention block learns a more abstract feature set than the last (up to a point). As a result, such a context vector can do much more than simple sequence alignment: it allows the model to associate the semantics of the input sequence with the current token in the output sequence. This property is foundational to the remarkable generative capabilities of large language models.

Because cross-attention does not attend to the decoder state, cross-attention is not masked.