08 Uses of masking in the encoder and the decoder

The transformer model uses masking in two different contexts:

Padded positions are masked to prevent the model from learning from positions after the end of sequence token, since they are not meaningful.
During training, the decoder uses masking for self-attention (but not for cross-attention) in order to prevent data leakage.

Both of these issues arise from the fact that the transformer, unlike the RNNs that came before it, attends to the entire sequence at once.

The “source” mask is a binary tensor whose elements are 1 for non-pad tokens and 0 for pad tokens for each source sequence in the batch. The additional “unsqueezed” dimension is to ensure that, after broadcast, it has the same rank as the “target” mask (because the same code (1, 2) handles all forms of attention.) It is initialized at the start of an inference and remains unchanged throughout the lifecycle of that inference. In the encoder, it is used for self-attention; in the decoder, it’s used for cross-attention.

The “target” mask is a binary tensor, which is likewise constant throughout the decoding process. To construct each matrix:

We create a padding mask like for the source mask, except based on padding for the target sequence.
We then create an upper triangular matrix. Since the self-attention weights represent the relevance of the -th sequence element to the -th sequence element, an upper triangular mask tells the model that nothing after the current position is relevant.
We bitwise-AND (1) and (2) to obtain the target mask.

Note that there is no target mask at inference time, which is why the attention() method and the MultiHeadedAttention class’ forward() method must handle the situation where the mask is not present.

David's raw ML reference notes

Explorer

08 Uses of masking in the encoder and the decoder

Graph View

Backlinks