What limits transformer sequence length?

Algebraically, a given transformer model can operate on sequences of unlimited length. (See the implementations of self-attention and multi-head attention.) However, there is both a theoretical and a physical limit to sequence length (setting aside physical storage limitations, which are not really the limiting factor for this algorthm.)

Physical limitation: time complexity. Self-attention is a weighted average of feature vectors. This must be computed using an all-by-all matrix of weights, which implies a comparison of each position to each other position. In other words, self-attention (at least when computed exactly) has a time complexity of . This is already somewhat limiting for very large , but then recall that we must make passes through the decoder in order to produce an -length output sequence, so the decoder has a time complexity of . The prefactor is also no joke.

Theoretical limitation: entropy. See Entropy of self-attention as a function of sequence length.

David's raw ML reference notes

Explorer

What limits transformer sequence length?

Graph View

Backlinks