Positional encoding is a strategy for injecting information about the relative position of elements in a sequence into their representation. Transformers require this information because attention does not inherently encode it (other than for the first and last tokens).
In Vaswani, et al. (2017), they alternate between a sequence based on sine and a sequence based on cosine, both scaled to the dimension of the model (which, for the paper, is
class PositionalEncoding(nn.Module):
"Implement the PE function."
def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer("pe", pe)
def forward(self, x):
x = x + self.pe[:, : x.size(1)].requires_grad_(False)
return self.dropout(x)Understanding through simplifcation
This struck me as rather mysterious, so I chose just the sine sequence and simplified it to its essence. Apologies for abusing PEP-8 conventions, and please refer to the constants glossary. Let
def simple_pe(L: int, D_m: int, kappa: float) -> torch.Tensor:
position = torch.arange(0, length).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model) * -(math.log(scaling_factor) / d_model))
pe = torch.sin(position * div_term)
return peLet’s now break this down.
position = torch.arange(0, length).unsqueeze(1)We’re going to want a
Now, PyTorch has no concept of row and column vectors per se. So if you want a row vector, you’re going to have to take your
Hence, this line gives us a [[0], [1], [2], ... [L]].
Now let’s consider the next two lines together.
div_term = torch.exp(torch.arange(0, d_model) * -(math.log(scaling_factor) / d_model))
pe = torch.sin(position * div_term)Let’s re-express this symbolically so that we can reason about it. Let
Recall that
So we can rewrite this as
Obviously,
We can express the negative exponent as the reciprocal of the positive exponent:
Substituting this in, we obtain a (somewhat) friendlier expression:
Now let’s get a handle on the period of this function. Recall that the period of a sine or cosine is
Solving for
In other words, holding
This indicates that for any given sequence position