Entropy of self-attention as a function of sequence length

Recall that self-attention is computed by first calculating a dot-product similarity score and then applying softmax. The central limit theorem states that, as the number of samples grows, the distribution of a sum (average) of samples must tend towards a normal distribution whose mean is the population mean and whose variance is the population variance. Hence, as the number of elements in the sequence grows, the mean similarity score between any two elements in the sequence approaches a normal distribution with mean and variance .

Now, recall that softmax is defined as

We said that, for long sequences, each raw score . Now, it can be shown that

Hence, for long sequences,

This implies that, for long sequences, the softmax scores will approach a uniform distribution. Now the definition of Shannon entropy is

So for a long sequence where , the average entropy of a K-dimensional attention weight vector is

For a sequence of length , the average entropy is . Hence we expect entropy to increase linearly with length, approaching uniformity.

David's raw ML reference notes

Explorer

Entropy of self-attention as a function of sequence length

Graph View

Backlinks