- Examines the bottlenecks causing resource constraints in transformers
- Identifies two major contributors: - The need to store activations for each layer to perform backpropagation - Dense SDPA is quadratic in time and space complexity
To address these: - Uses something called a “reversible residual network” to eliminate all but one stored copy of the residuals - Computes positional feed-forward values by chunking (breaking the input, weights, or both into submatrices and then recombining) - Introduces a sparse attention model based on locality-sensitive hashing
Collectively refers to these tweaks as a “Reformer”
Note that paper repeatedly refers to the feedforward layers as “deep,” but actually they are usually 2 layer MLPs or in that ballpark. They are talking about width.