Recall that self-attention is computed by first calculating a dot-product similarity score and then applying softmax. The central limit theorem states that, as the number of samples grows, the distribution of a sum (average) of samples must tend towards a normal distribution whose mean is the population mean and whose variance is the population variance. Hence, as the number of elements in the sequence grows, the mean similarity score
Now, recall that softmax is defined as
We said that, for long sequences, each raw score
Hence, for long sequences,
This implies that, for long sequences, the softmax scores will approach a uniform distribution. Now the definition of Shannon entropy is
So for a long sequence where
For a sequence