Efficient attention

From Zhuoran, et al. 2024
Based on the observation that the linear part of SDPA can be rearranged to involve fewer computations
In so doing, though, softmax is applied only to the query instead of to the product of the query and the key vectors
Hence a certain level of fine-grained spatial relationship is lost
For vision applications, where regional effects are more much important than pixel-by-pixel differences, this change in softmax has limited impact
Hence in these applications, the vastly improved time and memory scaling are an unequivocal win
However, for NLP applications, where the exact relationships between elements of a sequence can be determinative, it will likely reduce accuracy
This is not to say that efficient attention could not be beneficial in certain NLP applications, especially with very long context lengths
Though there are other techniques (such as hierarchical or sliding window attention) to help with this without loss of spatial resolution

David's raw ML reference notes