In the Annotated Transformer’s implementation of multi-head attention, the mask argument is “unsqueezed” to add a length-1 dimension, making it , where is the batch size and is the length of the input sequence. Why do we need this extra dimension?

Inside the attention function, the mask is applied to the raw attention scores:

d_k: int = query.size(-1)
scores: Tensor = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
	scores = scores.masked_fill(mask == 0, -1e9)

query and key are both tensors with dimensions . We then transpose the last two dimensions of key. torch.matmul does batched matrix multiplication, where the first dimensions are considered “batch dimensions,” so we end up with a scores tensor.

We need to perform an element-wise operation between scores and mask. If mask were still , we would not be able to broadcast it to scores, since the second and third dimensions would not be compatible. But with , we can.