EL-Attention proposes a method to reduce memory usage during inference in Transformer models by caching only past hidden states instead of keys and values.
By re-ordering matrix multiplication steps, EL-Attention can achieve the same results as traditional attention mechanisms with significantly reduced memory requirements.
EL-Attention provides an efficient way to handle attention mechanisms in transformer models, especially for decoder-only models, by halving the amount of caching memory needed.