Like basically all neural NLP models, the Transformer model embeds its tokens into a much lower dimensional space. Interestingly, the model eschews semantically rich embedding models like word2vec in favor of a simple feedforward network with a linear activation function. Let’s go through the code as it appears in the Annotated Transformer.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model
 
    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

The nn.Embedding class encapsulates a simple matrix whose elements are learned during training. Although conceptually similar to nn.Linear, it is optimized for lookups against a fixed set of integer inputs. As such, instead of expecting a -length sequence encoded as one-hot vectors, nn.Embedding expects it as a dense ragged vector of one-hot indices.

Notice that the forward layer scales the embedding matrix by . Theoretically, this accomplishes nothing because the model could counteract any arbitrary scaling during training. However, in practice, there is an interesting consequence.

Consider an arbitrary transformation matrix whose weights are initialized i.i.d. according to an unknown distribution function. Then the average magnitude of a projected vector would be

By linearity of expectation, we can rewrite the righthand side as

The weights of nn.Embedding are initialized according to the Gaussian , so definitionally they have variance 1. Now, variance is defined as

We can show that this leads to

and since , we see that . Hence

implying that the average magnitude of a projected vector would be . This means that, as we worked with ever larger models, the initial feature values would be smaller and smaller on average. This, in turn, would mean smaller loss gradients, and hence slower convergence.

By scaling these initial weights by , we instead ensure that , which is right where we want them for stable gradients.