Like basically all neural NLP models, the Transformer model embeds its tokens into a much lower dimensional space. Interestingly, the model eschews semantically rich embedding models like word2vec in favor of a simple feedforward network with a linear activation function. Let’s go through the code as it appears in the Annotated Transformer.
class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)The nn.Embedding class encapsulates a simple matrix whose elements are learned during training. Although conceptually similar to nn.Linear, it is optimized for lookups against a fixed set of integer inputs. As such, instead of expecting a nn.Embedding expects it as a dense ragged vector of one-hot indices.
Notice that the forward layer scales the embedding matrix by
Consider an arbitrary transformation matrix whose weights are initialized i.i.d. according to an unknown distribution function. Then the average magnitude of a projected vector
By linearity of expectation, we can rewrite the righthand side as
The weights of nn.Embedding are initialized according to the Gaussian
We can show that this leads to
and since
implying that the average magnitude of a projected vector
By scaling these initial weights by