The Vaswani, et al. (2017) Transformer is essentially a kind of Encoder-decoder architecture that replaces recurrent neural networks with three different applications of attention. At a high level, the mechanism works as follows:

  1. As with most (all?) E-D models (“EDMs”), the Transformer model first transforms token representations into embeddings.

  2. The embeddings are summed with a deterministic positional encoding, which allows the model to incorporate word order into its features.

  3. The embeddings are fed into a series of identical encoder blocks. Each block establishes long-range associations across the sequence using multi-head self-attention, then converts these to features using a simple Feedforward neural network.

  4. After each sub-layer, the model applies a residual connection and then normalizes the output. This also applies to the decoder.

  5. The output of the last encoder block is fed into the first of a series of decoder blocks. Like encoder blocks, the decoder blocks use attention to establish long-range associations, then use a feed-forward network to create features. However, decoder block uses two forms of attention sequentially: a modified version of multi-head self attention, called Masked self-attention; and then cross-attention, also called “encoder-decoder attention.”

  6. The output of the last decoder block is linearly projected into the output vocabulary space. This projection is softmaxed to obtain a probability distribution over the vocabulary, which is sampled to select a token.

  7. Steps 3 and 4 are repeatedly applied to a sequence consisting of the original sequence plus all previously generated output tokens. The process ends when the end-of-sequence token is sampled.