• From Zhuoran, et al. 2024
  • Based on the observation that the linear part of SDPA can be rearranged to involve fewer computations
  • In so doing, though, softmax is applied only to the query instead of to the product of the query and the key vectors
  • Hence a certain level of fine-grained spatial relationship is lost
  • For vision applications, where regional effects are more much important than pixel-by-pixel differences, this change in softmax has limited impact
  • Hence in these applications, the vastly improved time and memory scaling are an unequivocal win
  • However, for NLP applications, where the exact relationships between elements of a sequence can be determinative, it will likely reduce accuracy
  • This is not to say that efficient attention could not be beneficial in certain NLP applications, especially with very long context lengths
  • Though there are other techniques (such as hierarchical or sliding window attention) to help with this without loss of spatial resolution