The hottest Systems Substack posts right now

And their main takeaways

Sparsity means many weights or activations are zero so you can skip their multiplications, but random/unstructured zeros usually don’t make GPUs faster because irregular memory access and load imbalance kill performance.
Hardware-friendly patterns like 2:4 sparsity and block sparsity let accelerators actually speed up computation, while pruning and ReLU-driven activation sparsity often need structure or predictive gating to become efficient.
Conditional computation (Mixture of Experts) is the most powerful practical sparsity: only a few experts run per input, giving huge model capacity with much less active compute and strong empirical results.