Between the layers of each block in the transformer encoder and decoder, Vaswani, et al. (2017) employs a few bells and whistles. In The Annotated Transformer, these are captured in a class called SublayerConnection.

class SublayerConnection(nn.Module):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
 
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

Let’s unwrap this.

  1. Layer-normalize the input to the connection, x.
  2. Apply the sublayer to the normalized input.
  3. Apply dropout to the sublayer’s output.
  4. Add the input (i.e., the residual) back to the result.