Positional encoding is a strategy for injecting information about the relative position of elements in a sequence into their representation. Transformers require this information because attention does not inherently encode it (other than for the first and last tokens).

In Vaswani, et al. (2017), they alternate between a sequence based on sine and a sequence based on cosine, both scaled to the dimension of the model (which, for the paper, is ). The Annotated Transformer has it as

class PositionalEncoding(nn.Module):
    "Implement the PE function."
 
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
 
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)
 
    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

Understanding through simplifcation

This struck me as rather mysterious, so I chose just the sine sequence and simplified it to its essence. Apologies for abusing PEP-8 conventions, and please refer to the constants glossary. Let be our scaling factor. Then we can distill the above to

def simple_pe(L: int, D_m: int, kappa: float) -> torch.Tensor:
    position = torch.arange(0, length).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model) * -(math.log(scaling_factor) / d_model))
    pe = torch.sin(position * div_term)
 
    return pe

Let’s now break this down.

position = torch.arange(0, length).unsqueeze(1)

We’re going to want a matrix, because we’re going to add these positional encodings element-wise to each of our -dimensional hidden states. This implies that we’re going to want to take some kind of outer product of a row vector and a column vector.

Now, PyTorch has no concept of row and column vectors per se. So if you want a row vector, you’re going to have to take your -dimensional column vector and turn it into a matrix, which is all a row vector actually is anyway. tensor.unsqueeze adds a trivial dimension at the specified position in a tensor.

Hence, this line gives us a matrix of sequential integers, [[0], [1], [2], ... [L]].

Now let’s consider the next two lines together.

div_term = torch.exp(torch.arange(0, d_model) * -(math.log(scaling_factor) / d_model))
pe = torch.sin(position * div_term)

Let’s re-express this symbolically so that we can reason about it. Let be the positional encoding for a position and an embedding dimension . Then the above is equivalent to

Recall that

So we can rewrite this as

Obviously, . So we can simplify to

We can express the negative exponent as the reciprocal of the positive exponent:

Substituting this in, we obtain a (somewhat) friendlier expression:

Now let’s get a handle on the period of this function. Recall that the period of a sine or cosine is . So the period of is the time when the quantity inside the parentheses is :

Solving for , we see that the period is

In other words, holding constant, the period of with respect to is

This indicates that for any given sequence position , the period of its positional encoding for each dimension is exponentially longer than for the dimension before.

David's raw ML reference notes

Explorer

10 Implementation of positional encoding

Understanding through simplifcation

Graph View

Backlinks