In the context of self-attention in a transformer decoder block, “masking” refers to the setting the attention score to
Recall that language models are trained by comparing the actual output sequence to an expected output sequence. The loss function for such a model is typically the sum, for each element of the sequence, of the cross-entropy loss between a given ground-truth label and the predicted probability distribution for that term.
In an encoder-decoder model with a unidirectional recurrent decoder architecture, there is no way for information from the
By manually setting the attention to
See also: Autoregression (“auto-regression”)