Scaled dot-product attention

Scaled dot-product attention is the attention mechanism introduced for transformer models in Vaswani, et al. (2017). The name, I think, sells the strategy quite short; it is much more than just dot-product attention with a scaling factor.

Recall that dot-product attention is defined as:

The encoder state and decoder state have no direct equivalent in a Transformer, as attention is applied multiple times each in the encoder and decoder. But you could imagine taking the dot product of the query embedding and each value embedding and doing something analogous to dot-product attention. Scaled dot-product attention is not that.

To compute a context vector using SDPA, we first project our query into a “query space” via a learned matrix , and our input sequence elements into both a “key space” (via a learned matrix ) and a “value space” (via a learned matrix ). We end up with a query vector , a key vector , and a value vector . We pack all such vectors for the the entire input sequence into matrices , , and respectively. Then we compute

The scaling factor (where is the dimensionality of the key space) is because the dot-products can scale with the dimensionality of the model. So in high dimensions, softmax may provide near-equal values for all elements in the attention vector. The scaling factor fixes this. Still, this hardly seems like the key innovation here. To quote “Attention is All You Need”:

Dot-product attention is identical to our algorithm, except for the scaling factor of .

But where is the idea of a “key vector” in dot-product attention? By introducing this mechanism of indirection, they make it possible to compute attention on multiple independent feature spaces derived from the same sequence (called multi-head attention). They also recast the input sequence as a kind of key-value database, and attention as a kind of query over it. By doing so, they open the door to retrieval-augmented attention models.

All of this is 100% novel in so-called “scaled dot-product attention.”

See implementation of scaled dot-product attention

David's raw ML reference notes

Explorer

Scaled dot-product attention

Graph View

Backlinks