Basic idea

Cho, et al. (2014) introduced the Encoder-Decoder Model, an autoregressive neural network with three layers: an encoder (whose hidden states are -dimensional), a -to- projection layer, and a decoder (whose hidden states are -dimensional).

During training, a embedding matrix is also learned. (In practical terms, the model includes this matrix, though the network does not. The embedding matrix transforms one-hot-encoded -dimensional vectors into dense -dimensional embeddings.

The encoder and the decoder are both recurrent; i.e., they take both an exogenous input and the preceding hidden state as inputs to the layer. The external inputs to the encoder are the embeddings.

The initial external input the the decoder is a special “start-of-sequence” token and a -dimensional projection of the final encoder state. Each decoder state is used to generate an “output token” by converting the hidden state to a probability distribution (typically using softmax), and then sampling the resulting distribution. This token is encoded into a one-hot vector, which is used as the external input to the next hidden state.

The decoder keeps on generating states until one of them outputs the a special end-of-sequence token, at which point the process terminates.

Implications

The entire input sequence is compressed into a single hidden state at the time that the decoder begins generating tokens. As a result, it’s not going to do a great job with recognizing relationships between tokens. Bahdanau, Cho, and Bengio (2014) introduced attention to solve this problem specifically.