Self-attention is one of the two applications of scaled dot-product attention in Vaswani, et al. (2017), the other being cross-attention. Recall that scaled dot product attention is defined as

where , , and are matrices of the “query” vectors , the “key” vectors , and the “value” vectors respectively. (This is much more than a mere scaling of dot-product attention; see the discussion).

In self-attention, , , and are all derived from the same sequence. They have different values because they are projected through different learned weight matrices , , and respectively.

The resulting context vectors (one for each query vector ) represent the expected semantic content at each position in the sequence, given the feature space learned by the weight matrices , , and . (In practice, there are multiple such feature spaces, each learned separately.)