The transformer model uses masking in two different contexts:
- Padded positions are masked to prevent the model from learning from positions after the end of sequence token, since they are not meaningful.
- During training, the decoder uses masking for self-attention (but not for cross-attention) in order to prevent data leakage.
Both of these issues arise from the fact that the transformer, unlike the RNNs that came before it, attends to the entire sequence at once.
The “source” mask is a
The “target” mask is a
- We create a padding mask like for the source mask, except based on padding for the target sequence.
- We then create an upper triangular matrix. Since the self-attention weights
represent the relevance of the -th sequence element to the -th sequence element, an upper triangular mask tells the model that nothing after the current position is relevant. - We bitwise-
AND(1) and (2) to obtain the target mask.
Note that there is no target mask at inference time, which is why the attention() method and the MultiHeadedAttention class’ forward() method must handle the situation where the mask is not present.