Summary: The Transformer model uses self-attention to understand relationships between words in a sentence and encode them effectively. Multi-headed attention enhances the model by creating different representation subspaces. Positional encoding helps the model understand the order of words in the input sequence.
The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. (View Highlight)
Note: In a convolutional network, I have an intuition about what the successive layers mean: very roughly, they represent higher-order features (such as shapes in an image). I am having trouble forming such an intuition about transformers.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position. (View Highlight)
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models). (View Highlight)
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. (View Highlight)
each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. (View Highlight)
Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing. (View Highlight)
for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process. (View Highlight)
this is an architecture choice to make the computation of multiheaded attention (mostly) constant. (View Highlight)
Note: How does this work?
What are the “query”, “key”, and “value” vectors?
They’re abstractions that are useful for calculating and thinking about attention. (View Highlight)
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. (View Highlight)
The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example). (View Highlight)
The sixth step is to sum up the weighted value vectors. (View Highlight)
Note: The value at each position () represents the “meaning” of that word, as learned by that attention head. The attention at () is the average of these “meanings,” weighted by a measure of their relevance to position . Hence the attention at () can be understood as the expected value of the meaning of the token at position .
Note: i.e., having multiple attention heads means that each head can learn a different aspect of the semantics/meaning of input sequences.
we need a way to condense these eight down into a single matrix.
How do we do that? We concat the matrices then multiply them by an additional weights matrix WO. (View Highlight)
To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. (View Highlight)
Note: This is too vague for me to understand. I’ll need to see it in the actual paper.
New highlights added April 3, 2024 at 9:11 PM
To give the model a sense of the order of the words, we add positional encoding vectors — the values of which follow a specific pattern. (View Highlight)
Note: The positional vector is element-wise added to the embedding. This doesn’t ruin the embedding because the embedding matrix is learned on data to which this encoding is being added.
The output of the top encoder is then transformed into a set of attention vectors K and V. (View Highlight)
Note: I think this is wrong. K and V are projections of the input vectors, but there is only one set of attention vectors, which is the matrix Z. (Technically, there is a Z_h matrix for each attention head, which get concatenated, and then projected through W to get the final Z).
But what I think he means is that every layer of the decoder uses the same K and V matrices (from the last layer of the encoder), but uses the query vectors Q from the layer before.
This “cross-attention” (or “Encoder-Decoder Attention”) is a different mechanism from the “self-attention” discussed before; there is no linear projection of the prior output data into new vector representations.
Note: There’s an animation present in the article that Reader has stripped out.
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. (View Highlight)
The Linear layer is a simple fully connected neural network (View Highlight)
Note: i.e., with a linear activation function. This is how the model learns projection matrices (including the ones used in the self-attention blocks).