05 Implementation of the transformer encoder

In The Annotated Transformer, the encoder is invoked as part of the EncoderDecoder module:

class EncoderDecoder(nn.Module):
	...
	
    def forward(self, src, tgt, src_mask, tgt_mask):
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
 
	...
 
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

The encoding process involves first embedding the source (input) sequence, then running it through the encoder stack. Notice also that the encoder is passed a src_mask. This deals with padding rather than preventing data leakage. See “Uses of masking in the encoder and decoder”.

The encoder itself is, at a high level, quite simple:

class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
 
    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

As its name suggests, clones makes deep copies of layer. So really all we do here is pass state from one encoder “layer” (really a block) to the next. The mask does not change, as it’s just setting the attention for padded elements to zero. At the end of the process, we do one more layer norm before handing the output off to the decoder.

So what are these layers/blocks? So glad you asked!

class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size
 
    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

We don’t appear to ever use size, so ignore this. What actually matters here is:

Each layer consists of two steps: self-attention and then positionwise feed-forward.
These two are connected via a sublayer connection.

That’s it! all the complexity is encapsulated in its dependencies (and make_model).

David's raw ML reference notes

Explorer

05 Implementation of the transformer encoder

Graph View

Backlinks