The hottest Model Scaling Substack posts right now

A compact, curated reading list of landmark papers can teach roughly 90% of the core ideas and techniques in deep learning, offering a fast path to real understanding.
The essential topics span sequence models (RNNs/LSTMs/NTM), attention and transformers, convolutional vision models, theory of complexity and description length, training methods and scaling, and multimodal/speech work.
The publicly available partial list misses several important areas — notably reinforcement learning and meta-learning — so it should be supplemented with RL classics and recent advances like scaling laws, compute‑optimal training, mixture‑of‑experts, distillation, and key optimization tricks.

The idea that AI progress is surely slowing down might be too hasty. We may not have explored all the ways to improve AI through model scaling just yet.
Industry experts often change their predictions about AI, showing that they might not know as much as we assume. Their interests can influence their views, so take their forecasts with a grain of salt.
While new methods like inference scaling can boost AI capabilities quickly, the actual impact on real-world applications may take time due to product development lags and varying reliability.

DeepSeek's mHC challenges established assumptions about AI scaling and suggests new architectural ideas that could change how larger models are built and trained.
Residual connections are the unsung scaffolding of modern deep networks, providing a 'gradient highway' that keeps training stable across many layers.
The simple rule y = f(x) + x—adding the input back to a layer's output—was revolutionary because it preserves signals and gradients, making very deep networks trainable.

Artificial General Intelligence (AGI) might be possible by 2030 if we keep improving our computing power and models.
However, there are worries that after 2030, we could hit limits with our technology that will require us to find new ways to innovate.
We might need better algorithms and improved designs because just making computers bigger and faster won't be enough forever.

The Transformer model revolutionized Large Language Models (LLMs) with its parallel and scalable architecture.
Pre-training and fine-tuning, as seen in GPT-1 and BERT, significantly improved model performance for various tasks.
Bigger models, more data, and computing power have shown to lead to better performance in LLMs, but the relationship between model size, training tokens, and performance is more complex than initially thought.

Get a weekly roundup of the best Substack posts, by hacker news affinity: