In Inside Deep Learning, Raff builds his entire chapter on attention around a model to find the image corresponding to the largest value in a bag of MNIST images. To achieve this, he adds a softmax layer after label prediction. Because larger-valued labels will get larger softmax scores, doing this downweights the smaller labels. The author claims that this forces the model to “attend to” the right values, and therefore is a toy example of attention as such.
Needless to say, this was about the time that I started to regret spending so much time with this book. I’ll admit that it’s slightly clever to (ab)use softmax to achieve exponential separation between a sequence of ten equally spaced integers, and I’ll also admit that this encodes a strong prior directly into the model.
But if “attention” merely referred to any mechanism that causes a model to increase the weight given to relevant inputs, then literally any learnable parameter would be an “attention mechanism.” This nullifies the utility of the term.
Attention is a mechanism for assigning certain weights at inference time. It does this by measuring relevance between a set of input “keys” and another “query” input. The relevance measure gives a dynamic weight for a set of “values” corresponding to the keys. (These weights are used to construct an expected value of the feature space for the query.) The softmax trick has literally nothing to do with this.