Multi-head attention refers to multiple independent attention “blocks” in each layer of a Transformer, and then combining them. The model is allowed to train each block independently; consequently, they end up learning different hidden feature spaces.

Recall that a context vector represents the expected value of the feature vector for a position in a sequence. Hence, having multiple feature spaces allow each attention head to capture different aspects of the input sequence’s latent semantics.

As an informative oversimplification, we could imagine that one head would attend only to the tokens around the -th token, and another would attend only to the -th token specifically. The former would impute a meaning about any token, including an unknown token; the latter would capture a position-insensitive word embedding. (In practice, neither such representation is likely to come about.)

Vaswani, et al. (2017) implements multi-head attention in conjunction with either ordinary or masked self-attention. In that paper, each encoder or decoder layer consists of independent self-attention “heads.”