I have three different ways that I like to think about attention:

Attention is a mechanism for dynamically assigning certain weights at inference time.
Given a feature space with spatial structure (such as in a sequence or an image), attention is a means for finding the expected value of the features at a particular location in space.
Attention is a form of statefulness, which (in its most general form) provides a database-like abstraction within neural networks.

Specifically, attention is a mechanism by which a model can dynamically upweight certain “key” inputs based on the relevance of those inputs to another “query” input. The relevance of a key vector to a query vector determines the weight assigned to K’s corresponding value vector . The resulting vector is the attention or “context” vector . In this way, the model assigns certain weights dynamically at inference time.

This is fundamentally different from the process by which models learn that certain features are relevant to inputs during training. Merely learning that certain inputs are more important than others is not “attention”; it’s model training. This is a fine but critical distinction which is lost even on experienced practitioners.

Attention is a generalization of Naradaya-Watson regression¹, defined as

In this formulation, attention is any mechanism for learning the weighting function .

Attention per se was introduced in Bahdanau, Cho, and Bengio (2014) as a sequence alignment mechanism for recurrent encoder-decoder models. This original attention model used additive attention.

Two years later, Vaswani, et al. (2017) replaced explicit recurrence with a form of attention in which the key, value, and query vectors all refer to the same sequence. This retains the essential characteristic of recurrence, i.e., stateful awareness of context.

Sources

Chaudhari, et al. (2021) ↩

David's raw ML reference notes

Explorer

Attention (neural networks)

Sources

Graph View

Backlinks

David's raw ML reference notes

Explorer

Attention (neural networks)

Sources

Footnotes

Graph View

Backlinks