Transformer block

The term transformer block is a somewhat loose piece of jargon that has emerged since Vaswani, et al. (2017) introduced it. That original transformer consists of two “stacks” of such blocks: an encoder and a decoder. The blocks in each stack differed in that the former are bi-directional and the latter are unidirectional, but have many similarities. Today, the term refers to the subunits of any conceptually similar model.

The canonical encoder and decoder transformer blocks are expressed in Figure 1 of the Vaswani paper:

Both blocks accept an input embedding. This embedding is projected into three vector spaces for each attention head: key (K), value (V), and query (Q). One or more attention mechanisms are applied to each head, and the results are concatenated. This concatenated feature vector is fed into a fully connected layer that learns non-linear dependencies within and between these features.

Because these blocks tend to be stacked quite deep, they employ multiple mechanisms to prevent gradient instability. The figure above clearly shows skip connections and layer normalization (together, “Add & Norm”). Not shown is the use of dropout and attention scaling, though both are also used here.

The encoder and decoder blocks differ in their attention mechanisms.

The encoder block allows every non-padding position in the input sequence to attend to every other position (self-attention without masking). The resulting global context makes it much easier for the encoder to learn inter-dependencies, but not sequential dependencies. BERT’s masked language modeling task leverages this contextual understanding to greatly excel its conceptual predecessor, continuous bag of words.

The decoder block masks all target positions subsequent to the query. (In encoder-decoder models such as Vaswani, though, every position has unmasked access to the entire input sequence via cross-attention.) This causes the decoder to discover sequential dependencies between the preceding elements of a sequence and the next one. Because decoder-only models such as the GPT do not require cross-attention, they can apply masking to all positions simultaneously (causal masking) via algebra rather than sequentially via iterative decoding.

David's raw ML reference notes

Explorer

Transformer block

Graph View

Backlinks