I have three different ways that I like to think about attention:
- Attention is a mechanism for dynamically assigning certain weights at inference time.
- Given a feature space with spatial structure (such as in a sequence or an image), attention is a means for finding the expected value of the features at a particular location in space.
- Attention is a form of statefulness, which (in its most general form) provides a database-like abstraction within neural networks.
Specifically, attention is a mechanism by which a model can dynamically upweight certain “key” inputs based on the relevance of those inputs to another “query” input. The relevance of a key vector
This is fundamentally different from the process by which models learn that certain features are relevant to inputs during training. Merely learning that certain inputs are more important than others is not “attention”; it’s model training. This is a fine but critical distinction which is lost even on experienced practitioners.
Attention is a generalization of Naradaya-Watson regression1, defined as
In this formulation, attention is any mechanism for learning the weighting function
Attention per se was introduced in Bahdanau, Cho, and Bengio (2014) as a sequence alignment mechanism for recurrent encoder-decoder models. This original attention model used additive attention.
Two years later, Vaswani, et al. (2017) replaced explicit recurrence with a form of attention in which the key, value, and query vectors all refer to the same sequence. This retains the essential characteristic of recurrence, i.e., stateful awareness of context.