- A form of sparse attention because positions only attend to certain other positions
- Starting with scaled dot-product attention:
- Let
, such that the softmax term becomes - Observing that softmax is dominated by the largest terms, uses locality sensitive hashing to cluster the vectors of
by similarity - Split clustered data into equal sized chunks
- Compute SDPA on own chunk and preceding chunk; subsequent chunks “aren’t there” for purpose of local computation
- Merge the results
- Effectively sets the attention to future chunks to zero
- Let
This is the most impactful contribution of the Reformer model