Between the layers of each block in the transformer encoder and decoder, Vaswani, et al. (2017) employs a few bells and whistles. In The Annotated Transformer, these are captured in a class called SublayerConnection.
class SublayerConnection(nn.Module):
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
return x + self.dropout(sublayer(self.norm(x)))Let’s unwrap this.
- Layer-normalize the input to the connection,
x. - Apply the sublayer to the normalized input.
- Apply dropout to the sublayer’s output.
- Add the input (i.e., the residual) back to the result.