where , , and are matrices of the “query” vectors , the “key” vectors , and the “value” vectors respectively. (This is much more than a mere scaling of dot-product attention; see the discussion).
In self-attention, , , and are all derived from the same sequence. They have different values because they are projected through different learned weight matrices, , and respectively.