The hottest Memory Systems Substack posts right now

NVIDIA has folded Groq’s engineering and chip technology into its product line and is shipping the Groq LP30 inside LPX nodes to accelerate inference decode workloads.
The LP30 offers about 1.2 PFLOP FP8 performance and ~500 MB of SRAM per chip, with 8-chip LPX units giving 4 GB and full systems scaling to 256 chips / 128 GB, prioritizing huge SRAM bandwidth for high-throughput decoding.
NVIDIA will use its Dynamo orchestration to split work across Rubin, Rubin CPX and Groq LPX hardware (customers can mix up to ~25% Groq) so prefill and decode are handled by the best-suited chips to boost tokens-per-second for premium use cases.

They use a dataflow architecture that runs the compiler's intermediate graph directly instead of a traditional instruction stream, so pipelines stay full and ALUs can execute whole loops every cycle for much higher effective throughput.
Memory is handled by many small, localized MMU-like units plus runtime telemetry that adapts allocations to reduce false sharing, enabling an order-of-magnitude more outstanding memory requests and very high HBM utilization even on irregular workloads like GUPS.
Their go-to-market and tooling are HPC-first while supporting common parallel models (OpenMP, CUDA, Kokkos) with a "bring your own code" approach, hardware-accelerated low-overhead kernel reconfiguration, and chiplet/RDMA-style scaling, with AI-specialized designs planned later.

The current LLM trend is to scale models huge and use sparsity tricks like Mixture-of-Experts so only a small part of the model activates per token, reducing FLOPs.
Reusing an old technique — storing large, static lookup-like memories on CPU RAM and conditionally accessing them — can let models hold around 100B parameters off-GPU and avoid expensive dense computation.
The key insight is that many LLM costs come from simulating static lookup tables with neural computation, so replacing that simulation with real conditional lookups makes models much more efficient.

Free-recall questions are better than multiple-choice for effective learning
Automating basic facts through rote memorization can decrease the load on working memory and aid in understanding complex ideas
Using spaced repetition systems can be beneficial for understanding and retaining knowledge in various fields

Memory is organized as a graph not to store everything, but so edges can decay and useless paths are forgotten; forgetting is an intentional feature, not a bug.
What gets remembered depends on the agent’s goals, so memory must be filtered by a utility function before or during encoding; a single universal context that keeps everything will produce noise not useful memory.
Current AI systems are mostly search/archives, not true memory; real memory needs valuation-driven, lossy compression (e.g., reinforcing repetition or preserving surprise) to avoid overfitting and enable useful prediction.

Get a weekly roundup of the best Substack posts, by hacker news affinity: