Kitaev, Kaiser, and Leskaya (2020)

Examines the bottlenecks causing resource constraints in transformers
Identifies two major contributors: - The need to store activations for each layer to perform backpropagation - Dense SDPA is quadratic in time and space complexity

To address these: - Uses something called a “reversible residual network” to eliminate all but one stored copy of the residuals - Computes positional feed-forward values by chunking (breaking the input, weights, or both into submatrices and then recombining) - Introduces a sparse attention model based on locality-sensitive hashing

Collectively refers to these tweaks as a “Reformer”

Note that paper repeatedly refers to the feedforward layers as “deep,” but actually they are usually 2 layer MLPs or in that ballpark. They are talking about width.

David's raw ML reference notes

Explorer

Kitaev, Kaiser, and Leskaya (2020)

Graph View

Backlinks