In the Annotated Transformer’s implementation of multi-head attention, the
Inside the attention function, the mask is applied to the raw attention scores:
d_k: int = query.size(-1)
scores: Tensor = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)query and key are both tensors with dimensions key. torch.matmul does batched matrix multiplication, where the first scores tensor.
We need to perform an element-wise operation between scores and mask. If mask were still scores, since the second and third dimensions would not be compatible. But with