Encoder-decoder architectures, as the name suggests, consist of two major subnetworks. The first “encodes” an input sequence into a series of representations, at least one of which is then supplied to a “decoder” that generates output tokens.
The original encoder-decoder model consisted of two recurrent neural networks, coupled by the final hidden state of the encoder network. This single context vector contained everything that the decoder knew about the input. This is less than ideal for machine translation, as it makes direct word correspondences impossible to detect.
To solve this alignment problem, Bahdanau, Cho, and Bengio (2014) introduced an attention layer while retaining the two RNNs. Their primary goal was to provide the decoder with position-wise context, meaning that the decoder simultaneously translates and aligns.
Vaswani, et al. (2017) recognized that this same mechanism could encode an arbitrarily rich set of associations between input tokens, and introduced the Vaswani transformer model, which replaces the RNNs with “blocks” consisting of an attention mechanism plus a feed-forward network, but retains the overall encoder-decoder architecture.