efficientml • 1 HN point • 30 Apr 23
- EL-Attention proposes a method to reduce memory usage during inference in Transformer models by caching only past hidden states instead of keys and values.
- By re-ordering matrix multiplication steps, EL-Attention can achieve the same results as traditional attention mechanisms with significantly reduced memory requirements.
- EL-Attention provides an efficient way to handle attention mechanisms in transformer models, especially for decoder-only models, by halving the amount of caching memory needed.