Multi-head attention refers to multiple independent attention “blocks” in each layer of a Transformer, and then combining them. The model is allowed to train each block independently; consequently, they end up learning different hidden feature spaces.
Recall that a context vector represents the expected value of the feature vector for a position in a sequence. Hence, having multiple feature spaces allow each attention head to capture different aspects of the input sequence’s latent semantics.
As an informative oversimplification, we could imagine that one head would attend only to the tokens around the
Vaswani, et al. (2017) implements multi-head attention in conjunction with either ordinary or masked self-attention. In that paper, each encoder or decoder layer consists of