Masked self-attention

In the context of self-attention in a transformer decoder block, “masking” refers to the setting the attention score to for positions corresponding to future states.

Recall that language models are trained by comparing the actual output sequence to an expected output sequence. The loss function for such a model is typically the sum, for each element of the sequence, of the cross-entropy loss between a given ground-truth label and the predicted probability distribution for that term.

In an encoder-decoder model with a unidirectional recurrent decoder architecture, there is no way for information from the -th output token to leak into the model’s prediction for the -th token. However, as self-attention attends to the entire sequence at once, the gradient of the loss with respect to the final layer of the output would, by default, incorporate information about the entire sequence.

By manually setting the attention to for positions corresponding to future states, we ensure that their softmaxed weights will be . Hence these positions will not be factored into the context vector for any position corresponding to a state before them.

David's raw ML reference notes

Explorer

Masked self-attention

Graph View

Backlinks