efficientml

The hottest Substack posts of efficientml

And their main takeaways

Knowledge distillation helps smaller models match larger model performance.
Storing only top-k logits from the teacher model can save disk space and be beneficial.
Training multiple student models becomes more cost-effective with these optimizations.

EL-Attention proposes a method to reduce memory usage during inference in Transformer models by caching only past hidden states instead of keys and values.
By re-ordering matrix multiplication steps, EL-Attention can achieve the same results as traditional attention mechanisms with significantly reduced memory requirements.
EL-Attention provides an efficient way to handle attention mechanisms in transformer models, especially for decoder-only models, by halving the amount of caching memory needed.