The hottest Deep Learning Substack posts right now

And their main takeaways
Category
Top Technology Topics
Marcus on AI • 10552 implied HN points • 14 Mar 26
  1. Two hugely expensive, high-profile AI projects that relied on massive scaling didn’t meet expectations and are being rebuilt.
  2. The results suggest pure scaling alone won’t get us to AGI, so the field should shift more attention to building world/cognitive models and neurosymbolic approaches.
  3. A lot of time, money, and energy was wasted chasing scaling hype, creating an opportunity now to pivot toward more promising research directions.
Marcus on AI • 8299 implied HN points • 22 Jan 26
  1. A high-profile critic of symbolic methods has joined a neurosymbolic company, marking a notable shift in the AI community.
  2. Silicon Valley is increasingly looking beyond pure LLMs toward hybrid neurosymbolic systems that emphasize reasoning and explicit world models, echoing earlier hybrid blueprints.
  3. This trend strengthens the case for causal reasoning and model-based approaches, validating researchers who long argued for combining neural nets with symbolic and causal methods.
TheSequence • 259 implied HN points • 17 Mar 26
  1. Marble shifts focus from predicting video frames to building spatial intelligence instead of just generating pixels.
  2. It’s a Large World Model that reconstructs, generates, and simulates persistent 3D environments for richer, longer-lived scene understanding.
  3. The core idea is lifting 2D inputs into a 4D representation (adding depth and time) so the model can build and reason about persistent 3D worlds over time.
TheSequence • 189 implied HN points • 18 Mar 26
  1. AI research is often bottlenecked by humans having to run, wait for, and evaluate experiments, which keeps the research loop slow.
  2. AutoResearch is an agentic setup that autonomously forms hypotheses, edits code, launches training runs, and evaluates results so experiments can run without constant human intervention.
  3. Letting machines handle the experiment loop lets research proceed at machine speed, greatly speeding up progress and reducing the need for slow, synchronous human coordination.
The Kaitchup – AI on a Budget • 119 implied HN points • 18 Oct 24
  1. There's a new fix for gradient accumulation in training language models. This issue had been causing problems in how models were trained, but it's now addressed by Unsloth and Hugging Face.
  2. Several new language models have been released recently, including Llama 3.1 Nemotron 70B and Zamba2 7B. These models are showing different levels of performance across various benchmarks.
  3. Consumer GPUs are being tracked for price drops, making them a more affordable option for fine-tuning models. This week highlights several models for those interested in AI training.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
TheSequence • 133 implied HN points • 10 Mar 26
  1. World models are shifting from predicting 2D video pixels to reconstructing 3D geometry over time (4D), which lets systems model dynamic scenes more realistically.
  2. Spatial intelligence means AI can perceive volume, infer occluded parts, and predict temporal trajectories with mathematical precision.
  3. DeepMind's D4RT is a notable breakthrough that stitches fragmented observations into a unified 4D world model, improving how machines understand and predict changing environments.
TheSequence • 203 implied HN points • 04 Mar 26
  1. The Qwen 3.5 family spans from a 397B flagship to efficient 35B mediums and tiny 0.8–9B models designed to run on devices, covering the whole deployment stack. They’re clearly built to support everything from large-server workloads down to smartphones.
  2. This release marks a structural shift away from pure dense transformers: it reimagines attention, embraces extreme Mixture-of-Experts sparsity, and brings native multimodality even to small models. Those architectural changes are central to its engineering gains.
  3. Benchmarks show the flagship models trading blows with top proprietary systems like GPT-5.2 and Claude Opus 4.5, meaning open-weight models are closing the performance gap. Together with the new architectures and size range, this suggests more cost-effective scaling and wider deployment options.
TheSequence • 217 implied HN points • 03 Mar 26
  1. Passive video generation can make beautiful, consistent worlds but can’t be steered; true world models must understand agency and not just what happens.
  2. DeepMind’s Genie is one of the most advanced world models and represents a move toward interactive, controllable virtual environments.
  3. A key bottleneck is data: we don’t have enough controller/action data showing causes and effects to train truly actionable world models.
TheSequence • 252 implied HN points • 24 Feb 26
  1. Video generation models are now functioning as physics engines that can learn and predict object dynamics and interactions from data.
  2. OpenAI's Sora marked a turning point by framing video models as world simulators, shifting the focus from generating pixels to building data-driven models of physical reality.
  3. This shift is enabled by architectures like diffusion transformers, which combine diffusion processes with transformer models to capture complex spatiotemporal dynamics.
Gonzo ML • 252 implied HN points • 08 Feb 26
  1. A compact, curated reading list of landmark papers can teach roughly 90% of the core ideas and techniques in deep learning, offering a fast path to real understanding.
  2. The essential topics span sequence models (RNNs/LSTMs/NTM), attention and transformers, convolutional vision models, theory of complexity and description length, training methods and scaling, and multimodal/speech work.
  3. The publicly available partial list misses several important areas — notably reinforcement learning and meta-learning — so it should be supplemented with RL classics and recent advances like scaling laws, compute‑optimal training, mixture‑of‑experts, distillation, and key optimization tricks.
TheSequence • 112 implied HN points • 27 Feb 26
  1. RLHF has hit a conceptual ceiling: it produces fast, pattern‑matching “System 1” models that struggle to pause and do deep, deliberative reasoning.
  2. Relying on human raters is a bottleneck because preferences are noisy, slow, expensive, and can reject novel but correct outputs, so RLHF only scales as fast as humans can work.
  3. Reinforcement Learning with Verifiable Rewards (RLVR) replaces noisy human feedback with objective, checkable rewards so models can verify their own outputs and scale training toward more autonomous, System 2‑style reasoning.
Marcus on AI • 12133 implied HN points • 28 Jan 25
  1. DeepSeek is not smarter than older models. It just costs less to train, which doesn't mean it's better overall.
  2. It still has issues with reliability and can be expensive to run if you want it to 'think' for longer.
  3. DeepSeek may change the AI market and pose challenges for companies like OpenAI, but it doesn't bring us closer to achieving artificial general intelligence (AGI).
TheSequence • 245 implied HN points • 04 Feb 26
  1. Kimi 2.5 represents a paradigm shift from scale-driven "emergence" to orchestration, where the model coordinates complex workflows instead of just generating text.
  2. It functions as an end-to-end agent that manages execution environments, spawns subprocesses, and debugs its own visual outputs in a closed-loop system.
  3. The system uses sparsity to deliver trillion-parameter capability with the latency and cost profile similar to a ~32B dense model.
Marcus on AI • 7074 implied HN points • 09 Feb 25
  1. Just adding more data to AI models isn't enough to achieve true artificial general intelligence (AGI). New techniques are necessary for real advancements.
  2. Combining neural networks with traditional symbolic methods is becoming more popular, showing that blending approaches can lead to better results.
  3. The competition in AI has intensified, making large language models somewhat of a commodity. This could change how businesses operate in the generative AI market.
TheSequence • 147 implied HN points • 03 Feb 26
  1. There are different types of world models, and a clear taxonomy helps explain how they differ and what roles they play in AI.
  2. For decades, model-free reinforcement learning dominated: agents learned by reinforcing actions without building internal maps or understanding why those actions worked.
  3. Looking at the first major papers on world models reveals the origins and trade-offs of different approaches and shows why some models are better suited for planning and reasoning.
Software Bits Newsletter • 257 implied HN points • 29 Dec 25
  1. Associativity is the key property that lets you split work, combine partial results, and safely parallelize or stream computations without changing the answer.
  2. Softmax has a hidden associative state — tracking a local max and a scaled sum lets you correct and merge chunked results, which is the math behind FlashAttention’s memory- and time-saving trick.
  3. When optimizing a global computation, look for a small combinable state and an associative combine rule; if it exists you can chunk and parallelize, and if it doesn’t (for example, median) you need a different algorithmic approach.
Marcus on AI • 7153 implied HN points • 10 Nov 24
  1. The belief that more scaling in AI will always lead to better results might be fading. It's thought we might have reached a limit where simply adding more data and computing power is no longer effective.
  2. There are concerns that scaling laws, which have worked before, are just temporary trends, not true laws of nature. They don’t actually solve issues like AI making mistakes or hallucinations.
  3. If rumors are true about a major change in the AI landscape, it could lead to a significant loss of trust in these scaling approaches, similar to a bank run.
Gonzo ML • 252 implied HN points • 05 Jan 26
  1. A Universal Transformer–style model (URM) repeatedly applies a shared transformer layer with ACT, combining ConvSwiGLU and truncated backprop through loops to get very deep effective computation while keeping parameter count low.
  2. ConvSwiGLU injects a small depthwise convolution into the SwiGLU gating to mix local token context, and TBPTL reduces memory and training cost by only backpropagating through the final iterations.
  3. The model outperforms prior HRM/TRM baselines on tasks like Sudoku and ARC-AGI and Muon speeds convergence, but differences in evaluation protocols and some unclear experimental details mean independent verification is still needed.
Vasu’s Newsletter • 78 implied HN points • 25 Jan 26
  1. Each token creates query, key, and value vectors so it can ask what it needs, match that against other tokens, and gather useful information.
  2. Tokens compare their query to every key to get raw scores, convert those scores to attention weights with softmax, and use the weights to take a weighted sum of value vectors to produce a new contextual vector.
  3. Self-attention makes token meanings contextual (helping with pronouns, disambiguation, and long-range links), and models use multiple attention heads plus feed-forward layers to capture different relation patterns and refine each token's representation.
The Palindrome • 4 implied HN points • 14 Mar 26
  1. Machine learning means training predictive models from data. The core setup uses a dataset, a parametric model (a hypothesis), and a loss function to measure how well the model fits the data.
  2. A model approximates the true input–output relation and depends on both its parameters and the training data (often written h(x; w, D)). Models can be deterministic or probabilistic and belong to different families like generative or discriminative.
  3. Which learning paradigm you use depends on what inputs, outputs, and labels are available — the main paradigms are supervised, unsupervised, semi‑supervised, and reinforcement learning. In supervised learning you have input–label pairs and the goal is to learn the mapping from x to y.
Vasu’s Newsletter • 104 implied HN points • 05 Jan 26
  1. Text is split into discrete tokens, often subwords using Byte Pair Encoding, so a fixed vocabulary can represent any input by keeping common words whole and breaking rare words into parts.
  2. Each token ID is looked up in a learned embedding matrix to produce a dense vector, and these embeddings capture semantic and syntactic relationships learned during training.
  3. Embeddings are context-free and don’t encode position by themselves, so transformer mechanisms like attention and positional encodings combine them to determine meaning and word order.
Software Bits Newsletter • 103 implied HN points • 03 Jan 26
  1. Linearity lets you process many inputs as one big matrix multiply, so batching is nearly free and GPUs can run large batches with high efficiency.
  2. Differentiation is linear, so per-sample gradients can be summed and scaled — enabling gradient accumulation, distributed training, and efficient backprop.
  3. Non-linearities are required for expressivity, so networks interleave cheap, element-wise nonlinear functions with batch-friendly linear layers and prefer operations (like LayerNorm) that preserve batching advantages.
Import AI • 399 implied HN points • 13 May 24
  1. DeepSeek released a powerful language model called DeepSeek-V2 that surpasses other models in efficiency and performance.
  2. Research from Tsinghua University shows how mixing real and synthetic data in simulations can improve AI performance in real-world tasks like medical diagnosis.
  3. Google DeepMind trained robots to play soccer using reinforcement learning in simulation, showcasing advancements in AI and robotics;
TheSequence • 35 implied HN points • 17 Feb 26
  1. Recreating the world pixel-by-pixel isn’t the path to true intelligence, because generating images doesn’t prove a model understands the underlying concepts.
  2. JEPA (Joint Embedding Predictive Architecture) trains models to predict in a shared embedding space so they learn and forecast concepts instead of raw pixels, capturing semantics without rendering images.
  3. Several JEPA papers argue this is a promising way to build world models, suggesting we should shift research from generative reconstruction to predictive conceptual representations when measuring understanding.
Software Bits Newsletter • 103 implied HN points • 01 Jan 26
  1. Self-attention treats all positions symmetrically, so permuting tokens just permutes outputs; because attention is permutation‑equivariant, Transformers need positional encodings to learn token order.
  2. Commutativity is a deliberate design trade‑off: it enables parallelization and is perfect for unordered data like point clouds, sets, and graphs, but it destroys order information so you must use non‑commutative models or inject positions when order matters (language, time series).
  3. Commutativity shows up across ML: global pooling gives useful invariance but loses location, gradient aggregation and distributed training rely on commutative sums, and floating‑point associativity issues can still cause small nondeterminism.
Democratizing Automation • 760 implied HN points • 28 Jun 25
  1. Deep learning is not as complicated as it seems; the basic ideas are pretty straightforward and can be learned quickly with the right guidance. You don't need years of study to understand how it works.
  2. Getting the right random initialization for neural networks is crucial. If the initialization is too small, the signal can decay and become unnoticeable, making it hard for the model to learn effectively.
  3. Machine learning focuses on achieving good enough results rather than perfect solutions. It’s more about finding practical and useful models with the resources available.
TheSequence • 28 implied HN points • 10 Feb 26
  1. The Dreamer trilogy of papers reshaped how researchers build and use world models in AI.
  2. Model-based reinforcement learning inspired modern world models, focusing on agents that learn internal predictive models instead of directly mapping pixels to actions.
  3. Model-free methods like DQN succeeded in 2D games but struggled in complex 3D environments such as DeepMind Lab and Minecraft, revealing the limits of purely reactive agents and motivating the shift to world models.
TheSequence • 56 implied HN points • 14 Jan 26
  1. Bigger context windows aren't always the answer; dumping more text into attention can make a model's reasoning worse, not better.
  2. The paper calls this failure mode "context rot": as prompts grow, attention dilutes, the model's working set becomes unmanageable, and output quality drops.
  3. Instead of just expanding attention, we need different computational shapes—treating prompts more like environments and processing information recursively to avoid drowning the model in irrelevant context.
Marcus on AI • 4624 implied HN points • 16 Nov 23
  1. In the midst of an AI boom, scale isn't everything, and there are still unresolved issues.
  2. Recognition is growing that scoring well on benchmarks doesn't mean true foundational progress.
  3. Tech leaders like Sam Altman are acknowledging the limitations of deep learning and considering new paradigms.
Recommender systems • 26 implied HN points • 31 Jan 26
  1. Pre-training builds a base "world model" by predicting next tokens across huge text corpora, minimizing cross-entropy (negative log-likelihood) so the model learns facts, grammar, and reasoning.
  2. Supervised fine-tuning (SFT) teaches the model to follow instructions, and LoRA makes this efficient by adding small low-rank adapter matrices so you can adapt behavior without updating the entire model.
  3. Reinforcement approaches (like PPO) use a reward model, advantage estimates, clipping, and a KL penalty to safely push adapters toward human preferences, while Direct Preference Optimization (DPO) skips the reward model and trains a new adapter using a log-ratio objective between preferred and unpreferred responses.
Import AI • 718 implied HN points • 21 Aug 23
  1. Debate on whether AI development should be centralized or decentralized reflects concerns about safety and power concentration
  2. Discussion on the importance of distributed training and finetuning versus dense clusters highlights evolving AI policy and governance ideas
  3. Exploration of AI progress without needing 'black swan' leaps raises questions about the need for heterodox strategies and societal permissions for AI developers
Deep (Learning) Focus • 609 implied HN points • 08 May 23
  1. LLMs can solve complex problems by breaking them into smaller parts or steps using CoT prompting.
  2. Automatic prompt engineering techniques, like gradient-based search, provide a way to optimize language model prompts based on data.
  3. Simple techniques like self-consistency and generated knowledge can be powerful for improving LLM performance in reasoning tasks.
TheSequence • 42 implied HN points • 01 Jan 26
  1. Blanket scaling of transformers with more data and compute is showing diminishing returns, so new research directions are needed to keep improving frontier models.
  2. The field is shifting from generative AI that just looks right to verifiable AI that can deliberate and produce correct, auditable outputs, effectively adding a "System 2" for reasoning.
  3. Emerging methods like RLVR aim to give models unit-test-style feedback and tighter verification, and these kinds of approaches are poised to influence models shipping in 2026.
TheSequence • 35 implied HN points • 07 Jan 26
  1. DeepSeek's mHC challenges established assumptions about AI scaling and suggests new architectural ideas that could change how larger models are built and trained.
  2. Residual connections are the unsung scaffolding of modern deep networks, providing a 'gradient highway' that keeps training stable across many layers.
  3. The simple rule y = f(x) + x—adding the input back to a layer's output—was revolutionary because it preserves signals and gradients, making very deep networks trainable.
Deep (Learning) Focus • 373 implied HN points • 01 May 23
  1. LLMs are powerful due to their generic text-to-text format for solving a variety of tasks.
  2. Prompt engineering is crucial for maximizing LLM performance by crafting detailed and specific prompts.
  3. Techniques like zero and few-shot learning, as well as instruction prompting, can optimize LLM performance for different tasks.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 27 Jun 24
  1. Retrieval-Augmented Generation (RAG) mixes retrieval methods with learning systems to help large language models use real-time data.
  2. RAG can enhance the accuracy of language models by incorporating current information, avoiding wrong answers that might come from outdated knowledge.
  3. The framework of RAG includes steps like pre-retrieval, retrieval, post-retrieval, and generation, each contributing to better outputs in language processing tasks.
Trevor Klee’s Newsletter • 597 implied HN points • 26 Nov 24
  1. Emergent properties in biology can be hard to connect, kind of like trying to understand a car by randomly taking it apart. Even as we learn about proteins and genes, connecting them to actual biological traits remains a challenge.
  2. Deep learning models like Alpha Fold are changing the game by revealing connections between micro and macro biological features, even if we don't fully understand how they do it. It's like having a model that can assemble a car based on its parts without exactly knowing how all those parts work together.
  3. Recently, there's been exciting work in mechanistic interpretability, which helps us understand how these deep learning models make sense of biology. This could lead to new insights and even virtual experiments that help us learn about cell behavior and gene interactions.
Gonzo ML • 441 implied HN points • 27 Jan 25
  1. DeepSeek is a game-changer in AI, trained models at a much lower cost compared to its competitors like OpenAI and Meta. This makes advanced technology more accessible.
  2. They released new models called DeepSeek-V3 and DeepSeek-R1, which offer impressive performance and reasoning capabilities similar to existing top models. These require advanced setups but show promise for future development.
  3. Their multimodal model, Janus-Pro, can work with both text and images, and it reportedly outperforms popular models in generation tasks. This indicates a shift toward more versatile AI technologies.
TheSequence • 28 implied HN points • 25 Dec 25
  1. Scaling up transformers with more data and compute drove past AI gains, but that straightforward path is hitting limits because high-quality pretraining data and scaling efficiency are finite.
  2. The field is shifting to an "age of research" where diverse experiments and new ideas, not just bigger models, will determine future breakthroughs.
  3. Progress will come from a toolbox of new recipes — like souped-up pretraining, novel architectures, and improved fine-tuning — that turn compute into faster learning, better adaptation, and fewer odd model failures.