- From Zhuoran, et al. 2024
- Based on the observation that the linear part of SDPA can be rearranged to involve fewer computations
- In so doing, though, softmax is applied only to the query instead of to the product of the query and the key vectors
- Hence a certain level of fine-grained spatial relationship is lost
- For vision applications, where regional effects are more much important than pixel-by-pixel differences, this change in softmax has limited impact
- Hence in these applications, the vastly improved time and memory scaling are an unequivocal win
- However, for NLP applications, where the exact relationships between elements of a sequence can be determinative, it will likely reduce accuracy
- This is not to say that efficient attention could not be beneficial in certain NLP applications, especially with very long context lengths
- Though there are other techniques (such as hierarchical or sliding window attention) to help with this without loss of spatial resolution