Summary: The text provides a comprehensive overview of the developments in modeling attention within the Artificial Intelligence community, particularly in applications such as Natural Language Processing (NLP). It discusses the evolution of attention models, the taxonomy of attention types, key neural architectures utilizing attention, and various application domains benefiting from attention models. The text also touches upon the interpretability of neural networks facilitated by attention mechanisms and explores future research directions in this area.
The intuition behind attention can be best explained using human biological systems. For example, our visual processing system tends to focus selectively on some parts of the image, while ignoring other irrelevant information in a manner that can assist in perception [Xu et al. 2015]. Similarly, in several problems involving language, speech or vision, some parts of the input are more important than others. (View Highlight)
AM incorporates this notion of relevance by allowing the model to dynamically pay attention to only certain parts of the input that help in performing the task at hand effectively. (View Highlight)
They have been extensively used for improving interpretability of neural networks, which are otherwise considered as black-box models. (View Highlight)
they help overcome some challenges with Recurrent Neural Networks(RNNs) such as performance degradation with increase in length of the input and the computational inefficiencies resulting from sequential processing of input (View Highlight)
Note: The attention model inserts a neural layer that feeds into the S layer (what is that?)
the incorporation of attention in neural networks has led to significant gains in performance, provided greater insight into neural network’s inner working by facilitating interpretability, and also improved computational efficiency by eliminating sequential processing of input. (View Highlight)
The question is whether attention can be a stand-alone primitive for vision models instead of serving as an augmentation on top of convolutions. (View Highlight)
The field of model distillation aims to compress an existing large, complex model with a simpler model while retaining its accuracy (View Highlight)
The automated design of neural network architectures using neural architecture search (NAS) has outperformed human designs on various tasks. (View Highlight)
Existing attention mechanisms attend to individual items in the memory with a fixed granularity, e.g., a word token or a pixel in an image grid. Multi-instance attention is a generalization that allows attending to structurally adjacent group of items, e.g., 2D areas in images, or subsequences in natural language sentences. (View Highlight)
, training and deploying these models can be prohibitively costly for long sequences (such as in bioinformatics) as the standard self-attention mechanism of the Transformer uses 𝑂(𝑛2) time and space with respect to sequence length. (View Highlight)
New highlights added March 25, 2024 at 3:04 PM
h the estimator uses a weighted average where weights correspond to relevance of the training instance to the query: 𝑛 𝑦ˆ = ∑︁ 𝑖=1 𝛼(𝑥, 𝑥𝑖 )𝑦𝑖 . Here weighting function 𝛼(𝑥, 𝑥𝑖 ) encodes the relevance of instance 𝑥𝑖 to predict for 𝑥. (View Highlight)
Note: The key difference between this and attention models is that the kernel is applied to the other elements in the input sequence, rather than to the training examples (as in this transductive classifier). That is, “attention models” exhibit SELF-attention, whereas the Naradaya-Watson model (which must memorize its training data) exhibits attention to a query with respect to the training examples.
Note: From Claude:
In general, a “kernel” in machine learning is a function that measures the similarity between two inputs. It’s often used in the context of kernel methods, such as support vector machines (SVMs) or kernel regression, where the kernel function implicitly maps inputs into a higher-dimensional space.
More formally, a kernel function K(x, y) satisfies the following properties:
Symmetry: K(x, y) = K(y, x) for all x and y.
Positive semi-definiteness: For any finite set of inputs {x_1, …, x_n}, the n × n matrix K with elements K_ij = K(x_i, x_j) is positive semi-definite. This means that for any vector z, z^T * K * z ≥ 0.
attention mechanism in deep models can be viewed as a generalization that also allows learning the weighting function. (View Highlight)
A sequence-to-sequence model consists of an encoder-decoder architecture (View Highlight)
Note: My understanding of this is as follows:
0. During training, the model learns an embedding matrix of size V x D, where V is the size of the vocabulary and D << V is the desired size of the embedding. (This can also be another layer(s) in the network, as with BERT and ELMo.)
The input sequence is tokenized. Each token has a pre-assigned numerical mapping. (Not sure what you do about unknown tokens.) The tokens are then encoded as one-hot vectors of length V, where V is the size of the vocabulary. These vectors are multiplied by an embedding matrix to produce N dense vectors of dimensionality << V.
The N dense vectors are passed through the encoder, which takes as inputs both the next input vector and a hidden state representing the prior vector. This implies a “memory” because is affected by . The weights are the same for each pass; the encoder just has two inputs (prior hidden state and current vector).
After vector is processed by the encoder, it is passed as an input to the decoder. The decoder generates length V vectors that can be softmaxed to word probabilities. At some point, it generates a vector whose largest value corresponds to a stop token, and it ends.
New highlights added March 25, 2024 at 3:15 PM
the encoder has to compress all the input information into a single fixed length vector ℎ𝑇 that is passed to the decoder. (View Highlight)
it is unable to model alignment between input and output sequences, which is an essential aspect of structured output tasks such as translation or summarization (View Highlight)
Note: “Alignment” here refers to the correspondence of individual words in, eg, a translation task (gato —> cat, el —> the, etc.)
AM aims at mitigating these challenges by allowing the decoder to access the entire encoded input sequence {ℎ1, ℎ2, …, ℎ𝑇}. The central idea is to induce attention weights 𝛼 over the input sequence to prioritize the set ofpositions where relevant information is present for generating the next output token. (View Highlight)
Note: The AM introduces an additional “block” of nodes that assigns relevance weights to each embedding in the input sequence.
New highlights added March 25, 2024 at 4:15 PM
The attention block in the architecture is responsible for automatically learning the attention weights 𝛼𝑖 𝑗 , which capture the relevance between ℎ𝑖 (the encoder hidden state) and 𝑠𝑗−1 (the decoder hidden state) (View Highlight)
Note: is a vector input to the decoder when it is computing hidden state . It is calculated by performing an elementwise summation of each hidden state multiplied with a relevance coeficient .
The attention block is another recurrent network so that it can generate T outputs from an arbitrary number of inputs T.
Note: Is is some intermediate placeholder to make the expression for shorter. We’re saying that is the probability of the value given by aligning decoder state with encoder state .
In the traditional framework, context vector is just the last hidden state of the encoder ℎ𝑇. In the attention based framework, context at a given decoding step 𝑗 is combination of all hidden states 𝑇 of the encoder and their corresponding attention weights (View Highlight)
New highlights added March 25, 2024 at 5:15 PM
. In the attention based framework, context at a given decoding step 𝑗 is combination of all hidden states 𝑇 of the encoder and their corresponding attention weights (View Highlight)
The attention weights are learned by incorporating an additional feed forward neural network within the architecture. (View Highlight)
Note: In contrast to a recurrent network. This is a “vanilla” network. Each time it’s invoked (for each <h_i, s_j> pair), it does a separate inference. It’s basically just a function that happens to have been trained as a NN.
Note: It learns at training time; this is not something that happens during inference.
When the functions 𝑎 and 𝑝 are differentiable, the whole attention based encoder-decoder model becomes one large differentiable function and can be trained jointly with encoder-decoder components of the architecture using simple backpropagation. (View Highlight)
a mapping of sequence of keys 𝐾 to an attention distribution 𝛼 according to query (View Highlight)
Note: i.e., you can think of the attention block as a database that fetches the most relevant information for the current decoding step.
So the decoder “queries” the “database” with its current hidden state, and gets back a ranked (or at least rankable) list of relevant terms.
The resulting vector of weights is the “retrieved” context used to compute the next hidden state.
This is closely related to RAG, in that RAG augments this with calls to an external store.
Note: For each value vector (which is either just a hidden state from the encoder or that plus some additional information from elsewhere), we multiply it by its computed weight. The sum of such vectors is the context. Hence, for each hidden state, we encode scored “search results” that can be used as an input to the decoder.
Here the instance 𝑥 is the query, the training data points 𝑥𝑖 are keys and their labels 𝑦𝑖 are values. (View Highlight)
New highlights added March 25, 2024 at 6:15 PM
The first major category of alignment functions are based on a notion of comparing query representations with key representations. For example, one approach is to compute either the cosine similarity or the dot product between the key and query representations (see Table 2) (View Highlight)
Note: i.e., they are projected into a common space (if needed) and then subjected to a similarity function.
The second major category of alignment functions combine keys and query to form a joint representation (View Highlight)
Note: i.e., you just get a vector that’s the concatenation of both, which is the input to a single-vector-valued alignment function. You do this if you want to align using a NN.
Distribution functions map alignment function scores to attention weights. The most commonly used distribution functions are logistic sigmoid and softmax. (View Highlight)
In case of softmax function, attention weights can be interpreted as the probability that the corresponding element is the most relevant. (View Highlight)
Distribution functions such as sparsemax [Martins and Astudillo (View Highlight)
2016] and sparse entmax [Martins et al. 2020; Peters et al. 2019] are able to produce sparse alignments and assign nonzero probability to only a short list of plausible outputs. (View Highlight)
New highlights added March 26, 2024 at 11:08 AM
several extensions of attention modeling have been proposed in the literature to solve specific problem formulations (View Highlight)
Most attention models employed for translation [Bahdanau et al. 2015], image captioning [Xu et al. 2015], and speech recognition [Chan et al. 2016] fall within the distinctive type of attention. (View Highlight)
A co-attention model operates on multiple input sequences at the same time and jointly learns their attention weights, to capture interactions between these inputs. (View Highlight)
for tasks such as text classification and recommendation, input is a sequence but the output is not a sequence. In this scenario, attention can be used for learning relevant tokens in the input sequence for every token in the same input sequence. In other words, the key and query states belong to the same sequence for this type of attention (View Highlight)
attention may be applied on multiple levels of abstraction of the input sequence in a sequential manner. The output (context vector) of the lower abstraction level becomes the query state for the higher abstraction level. (View Highlight)
It first builds an attention based representation of sentences with first level attention applied on sequence ofword embedding vectors. Then it aggregates these sentence representations using a second level attention to form a representation of the document. This final representation of the document is used as a feature vector for the classification task. (View Highlight)
when multiple attention layers are used, higher level attention layers utilize the knowledge from lower level attention layers (visual information) and the refined query vector (question information) to extract more fine-grained and smaller regions within the image. (View Highlight)
it uses a weighted average of all hidden states ofthe input sequence to build the context vector. (View Highlight)
Note: soft attention
hard attention model in which the context vector is computed from stochastically sampled hidden states in the input sequence (View Highlight)
The hard attention model is beneficial due to decreased computational cost, but making a hard decision at every position of the input renders the resulting framework non-differentiable and difficult to optimize (View Highlight)
The key idea is to first detect an attention point or position within the input sequence and pick a window around that position to create a local soft attention model. The position within input sequence can either be set (monotonic alignment) or learned by a predictive function (predictive alignment). (View Highlight)
Note: In multi-representational AM, the input sequence is embedded multiple times, and these representations are then weighted as part of the attention process.
Note: Single-representation (ordinary) attention computes a weight for each hidden state of the encoder, where each hidden state corresponds to a single input embedding. In multi-dimensional attention, the embedding dimensions are also weighted.
AM can take any input representation and reduce it to a single fixed length context vector to be used in the decoding step. Thus, it allows one to decouple the input representation from the output. One could exploit this benefit to introduce hybrid encoder-decoders, the most popular being Convolutional Neural Network (CNN) as an encoder, and RNN or Long Short Term Memory (LSTM) as the decoder. This type of architecture is particularly useful for many multi-modal tasks such as Image and Video Captioning, Visual Question Answering and Speech Recognition. (View Highlight)
Note: This discussion is a bit opaque and would benefit from some equations. I’m not going to work too hard to understand it; I’ll get the same information from “Attention is All You Need” and/or a transformer-specific review.
the authors in [Vaswani et al. 2017] proposed Transformer architecture that com- pletely eliminates sequential processing and recurrent connections. It relies only on self attention mechanism to capture global dependencies between input and output. (View Highlight)
rather than only computing the attention once, the multi-head mechanism splits the input into fixed-size segments and then computes the scaled dot-product attention over each segment in parallel. (View Highlight)
The independent attention outputs are then concatenated into expected dimensions. (View Highlight)
Transformers can capture global/long range dependencies between input and output, support parallel processing, require minimal inductive biases (prior knowledge), demonstrate scalability to large sequences and datasets, and allow domain-agnostic processing of multiple modalities (text, images, speech) using similar processing blocks. (View Highlight)
i) input text has to split into fixed number of segments resulting in context fragmentation ii) high parametric complexity which results in computational cost and resources iii) large training data requirements due to minimal inductive bias iv) difficulty in interpreting what self attention mechanism learns and what is the contribution of input tokens towards predictions. (View Highlight)
Note: The quality of the review really seems to be dropping. I’m going to skip down to the applications section.
We can think of memory networks as generally having three components: (i) A process that “reads” raw database, and converts them into distributed representations. (ii) A list of feature vectors storing the output of the reader. This can be understood as a “memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them. (View Highlight)
(iii) A process that “exploits” the content of the memory to sequentially perform a task, at each time step having the ability put attention on the content of one memory element (or a few, with a different weight). (View Highlight)
In the NLP domain, attention assists in focusing on the relevant parts of the input sequence, align- ment of input and output sequences, and capturing long range dependencies for longer sequences. (View Highlight)
Question Answering problems have made use of attention to better understand questions by focusing on relevant parts of the question [Hermann et al. 2015] and store large amount of information using memory networks to help find answers [Sukhbaatar et al. 2015]. (View Highlight)
In the Sentiment Analysis task, self attention helps to focus on the words that are important for determining the sentiment ofinput. (View Highlight)
Text Classification and Text Representation problems mainly make use of self attention to build more effective sentence or document representations/embeddings. (View Highlight)