• A form of sparse attention because positions only attend to certain other positions
  • Starting with scaled dot-product attention:
    • Let , such that the softmax term becomes
    • Observing that softmax is dominated by the largest terms, uses locality sensitive hashing to cluster the vectors of by similarity
    • Split clustered data into equal sized chunks
    • Compute SDPA on own chunk and preceding chunk; subsequent chunks “aren’t there” for purpose of local computation
    • Merge the results
    • Effectively sets the attention to future chunks to zero

This is the most impactful contribution of the Reformer model