The hottest Substack posts of efficientml

And their main takeaways
1 HN point 30 Apr 23
  1. EL-Attention proposes a method to reduce memory usage during inference in Transformer models by caching only past hidden states instead of keys and values.
  2. By re-ordering matrix multiplication steps, EL-Attention can achieve the same results as traditional attention mechanisms with significantly reduced memory requirements.
  3. EL-Attention provides an efficient way to handle attention mechanisms in transformer models, especially for decoder-only models, by halving the amount of caching memory needed.