• Examines the bottlenecks causing resource constraints in transformers
  • Identifies two major contributors:      - The need to store activations for each layer to perform backpropagation      - Dense SDPA is quadratic in time and space complexity

To address these:      - Uses something called a “reversible residual network” to eliminate all but one stored copy of the residuals      - Computes positional feed-forward values by chunking (breaking the input, weights, or both into submatrices and then recombining)      - Introduces a sparse attention model based on locality-sensitive hashing

Collectively refers to these tweaks as a “Reformer

Note that paper repeatedly refers to the feedforward layers as “deep,” but actually they are usually 2 layer MLPs or in that ballpark. They are talking about width.