The hottest Computer Vision Substack posts right now

And their main takeaways

The Sequence Knowledge #829: World Models and Physical AI

TheSequence • 280 implied HN points • 24 Mar 26

🕹 Technology Computer Vision

Most modern world models focus on temporal prediction by hallucinating the next video frame pixel-by-pixel.
World Labs’ Marble marks a shift to spatial intelligence as a Large World Model that reconstructs, generates, and simulates persistent 3D environments.
The core idea is lifting 2D inputs into 4D representations so models can reason about space and time together.

It's all a blur

lcamtuf’s thing • 11631 implied HN points • 06 Feb 26

🕹 Technology Computer Vision

Averaging-based blurs are linear and often reversible, so knowing the filter and padding lets you set up simple equations to recover original pixels.
A right-aligned moving average makes iterative reconstruction straightforward and can reveal fine detail even with large blur windows, though 8-bit quantization adds visible noise.
Two-pass (X then Y) blurs can still be inverted if the filter biases the current pixel, and recovered images can survive normal lossy formats like JPEG unless compression is very heavy.

The Sequence Knowledge #825: Inside World Labs Marble

TheSequence • 259 implied HN points • 17 Mar 26

🕹 Technology Computer Vision

Marble shifts focus from predicting video frames to building spatial intelligence instead of just generating pixels.
It’s a Large World Model that reconstructs, generates, and simulates persistent 3D environments for richer, longer-lived scene understanding.
The core idea is lifting 2D inputs into a 4D representation (adding depth and time) so the model can build and reason about persistent 3D worlds over time.

UX Roundup: Generate in 4K | Research Reproducibility Crisis | NotebookLM Videos Perform Well | Midjourney v.8 | Micropayments | Microsoft Image Model | NVIDIA GTC

Jakob Nielsen on UX • 21 implied HN points • 23 Mar 26

🕹 Technology Computer Vision

Generate images at very high resolution (4K) because iterative edits and repeated modifications degrade quality, so starting large preserves fidelity for the final, smaller publish size.
A large share of top-tier UI/HCI studies fail replication, so interface research can generalize poorly and it’s safest to rely on findings that have been independently reproduced across methods and domains.
Micropayments for AI agents look promising since agents can automatically spend small budgets to access paid, high-quality content; new protocols like MPP could make this practical and help fund better content and better AI.

Fact checking Moravec's paradox

AI Snake Oil • 1797 implied HN points • 29 Jan 26

🕹 Technology Computer Vision

The idea that tasks humans find hard are easy for AI, and vice versa, isn't backed by solid evidence. It's largely a selection effect because researchers focus on problems they find interesting and ignore tasks that are too easy or too hard to bother with.
The evolutionary story that perception and motor skills are inherently harder than abstract reasoning is shaky. Whether a task is easy or hard for AI depends on domain openness, feedback, and available data, and breakthroughs (like deep learning for vision) can change what's difficult.
Relying on that rule of thumb to predict AI's next moves is misleading. It's better to plan for how new capabilities are actually deployed and build adaptable policies, since diffusion, infrastructure, and real-world constraints shape impacts more than simple capability predictions.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Physical Intelligence Wins the Olympics: Deep Dive

General Robots • 1814 implied HN points • 22 Jan 26

🕹 Technology Computer Vision

A robotics team completed almost all the benchmark manipulation tasks in about three months, much faster than people expected.
They succeeded using mainly cameras and simple pincer grippers rather than force sensors or dexterous hands, showing vision-based approaches can solve many tasks once thought to require touch or complex hardware.
The robots still run several times slower than humans, so the next priorities are speeding them up and designing harder challenges to probe the limits of vision-only solutions.

The Sequence Knowledge #821: 4D and World Models and the Amazing DeepMind D4RT

TheSequence • 133 implied HN points • 10 Mar 26

🕹 Technology Computer Vision

World models are shifting from predicting 2D video pixels to reconstructing 3D geometry over time (4D), which lets systems model dynamic scenes more realistically.
Spatial intelligence means AI can perceive volume, infer occluded parts, and predict temporal trajectories with mathematical precision.
DeepMind's D4RT is a notable breakthrough that stitches fragmented observations into a unified 4D world model, improving how machines understand and predict changing environments.

The Sequence Knowledge #817: DeepMind Genie and Interactive World Models

TheSequence • 217 implied HN points • 03 Mar 26

🕹 Technology Computer Vision

Passive video generation can make beautiful, consistent worlds but can’t be steered; true world models must understand agency and not just what happens.
DeepMind’s Genie is one of the most advanced world models and represents a move toward interactive, controllable virtual environments.
A key bottleneck is data: we don’t have enough controller/action data showing causes and effects to train truly actionable world models.

Grok is making things up — and then deleting the evidence

Weaponized • 52 implied HN points • 13 Mar 26

🕹 Technology Computer Vision

Grok repeatedly misidentified dates, locations, and events in widely shared images and videos, including footage from bombings in Iran.
Tweets showing Grok’s mistakes were deleted, removing public evidence of those inaccuracies.
Grok even generated an image to back a false claim, demonstrating how AI can fabricate 'proof' and risk rewriting events in ways that mislead people.

Benjie's Humanoid Olympics: Part II

General Robots • 732 implied HN points • 27 Jan 26

🕹 Technology Computer Vision

Robotics is progressing faster than expected, so more difficult, real-world challenges are needed to keep driving breakthroughs.
The new tasks emphasize dynamic movement, fine fingertip dexterity, tool use, and whole-body manipulation through everyday activities like catching eggs, cooking, folding sheets, hammering, and getting into a car.
A competition framework awards medals and asks teams to demonstrate success with videos, inviting community participation and leaving some earlier challenges still unclaimed.

The Sequence Knowledge #812: The Sora Moment: When Video Models Became Physics Engines

TheSequence • 252 implied HN points • 24 Feb 26

🕹 Technology Computer Vision

Video generation models are now functioning as physics engines that can learn and predict object dynamics and interactions from data.
OpenAI's Sora marked a turning point by framing video models as world simulators, shifting the focus from generating pixels to building data-driven models of physical reality.
This shift is enabled by architectures like diffusion transformers, which combine diffusion processes with transformer models to capture complex spatiotemporal dynamics.

No One Can Tell What’s Real Anymore

Weaponized • 14 implied HN points • 18 Mar 26

🕹 Technology Computer Vision

There is no universally accepted, reliable way to tell if an image or video was made by AI, whether you're a member of the public, a journalist, or an engineer.
Verification today uses a mix of methods—watermarks, detectable artifacts, provenance checks—but each method only works sometimes and leaves big gaps.
Those gaps create a gray zone where uncertain content can linger and allow disinformation to spread easily.

The Shape of AI: Jaggedness, Bottlenecks and Salients

One Useful Thing • 1423 implied HN points • 20 Dec 25

🕹 Technology Computer Vision

AI ability is jagged: it can be superhuman at some tasks (like reasoning or math) and weak at others (like memory or simple real-world interactions), so humans and AI will often end up complementing each other.
A single weak link can bottleneck an entire process, and those bottlenecks can be technical or institutional; when a lab fixes a key bottleneck (a "reverse salient") the whole system can leap forward.
Fixing bottlenecks can cause sudden lurches—better image generation already unlocked automated slide creation—yet humans will still be needed for edge cases, social coordination, and tasks requiring memory or physical action, so changes will be uneven and create new opportunities.

Deep Learning Legends: Ilya's List, or 90% of Everything That Matters in AI

Gonzo ML • 252 implied HN points • 08 Feb 26

🕹 Technology Computer Vision

A compact, curated reading list of landmark papers can teach roughly 90% of the core ideas and techniques in deep learning, offering a fast path to real understanding.
The essential topics span sequence models (RNNs/LSTMs/NTM), attention and transformers, convolutional vision models, theory of complexity and description length, training methods and scaling, and multimodal/speech work.
The publicly available partial list misses several important areas — notably reinforcement learning and meta-learning — so it should be supplemented with RL classics and recent advances like scaling laws, compute‑optimal training, mixture‑of‑experts, distillation, and key optimization tricks.

Image generation: Still crazy after all these years

Marcus on AI • 6126 implied HN points • 25 Jun 25

🕹 Technology Computer Vision

AI image generation technology is still struggling to understand complex prompts. Even with recent updates, it often fails at specific tasks.
There's a big difference between making an AI produce a certain image and it truly understanding what the words mean. AI might get lucky sometimes, but it doesn't reliably get it right.
Despite promises of advanced technology, AI still has a long way to go before it can provide high-quality, detailed images based on deep language understanding.

~70% of PHerc. 172 is now digitally unwrapped

Vesuvius Challenge • 98 implied HN points • 13 Jan 26

🕹 Technology Computer Vision

The team has digitally unwrapped about 70% of the lower region of PHerc. 172 using a new automated pipeline that's over 10× faster than fully manual methods, though humans still must fix sheet‑switch errors.
The unwrapped area covers roughly 7 meters by 14 cm and gives semi‑continuous surfaces with readable ink mainly on outer wraps and fragments; the upper ~30% is too mangled to unwrap reliably and the 7.9 µm scan resolution limits legibility compared with clearer 2.4 µm rescans.
Help is needed to improve surface extraction (to reduce sheet switches), strengthen ink detection in hard inner regions, and make the pipeline more scalable and user‑friendly—there's an ongoing Kaggle challenge for surface detection.

The Sequence Knowledge #808: Stop Trying to Generate the World: Inside the JEPA Way for World Models

TheSequence • 35 implied HN points • 17 Feb 26

🕹 Technology Computer Vision

Recreating the world pixel-by-pixel isn’t the path to true intelligence, because generating images doesn’t prove a model understands the underlying concepts.
JEPA (Joint Embedding Predictive Architecture) trains models to predict in a shared embedding space so they learn and forecast concepts instead of raw pixels, capturing semantics without rendering images.
Several JEPA papers argue this is a promising way to build world models, suggesting we should shift research from generative reconstruction to predictive conceptual representations when measuring understanding.

Computer, enhance!

Conspirador Norteño • 52 implied HN points • 31 Jan 26

🕹 Technology Computer Vision

AI "enhancements" can't recover real details that aren't in the original image; the models fill missing parts with invented content based on their training data, not the actual scene.
Outputs are strongly shaped by prompts and the model, so unmasking or upscaling attempts can produce wildly different and fabricated features like beards or tattoos, making them unreliable for identifying people.
AI-altered frames can add impossible or false actions (for example, a gun firing a flamethrower‑like blast), so such edits can mislead viewers and should not be treated as evidence.

The Progression of the ARC-AGI Frontier

Human Programming • 25 implied HN points • 19 Feb 26

🕹 Technology Computer Vision

The ARC benchmark has evolved and different solution families have led the frontier over time; early winners used program-search while recent progress comes from LLM-based pipelines that rely on synthetic pretraining, test-time fine-tuning, and augmentation/voting tricks.
High leaderboard scores don’t mean AGI because teams can exploit pretraining, dataset leakage, or massive compute to solve benchmarks; true general intelligence would quickly and cheaply solve newly released ARC tasks without prior exposure.
Commercial LLMs currently drive most top results and improvements in base models lift many approaches, but hybrid methods like program synthesis and symbolic reasoning remain promising, and upcoming refreshed benchmarks will reveal whether LLMs truly generalize.

Charting AI's Rise: 2025's Intelligence Breakthroughs Visualized

Maximum Truth • 88 implied HN points • 31 Dec 25

🕹 Technology Computer Vision

AI systems made rapid, large intelligence gains in 2025 on a Mensa-style offline IQ test, with several models reaching scores in the human-intelligence range.
Visual understanding improved significantly, enabling models to read and reason from images directly, which could let them gather new real-world training data beyond online text.
Progress was global and diverse: open-source and Chinese models closed ground and formerly weak systems like Grok rose fast, increasing competition and reducing single-company dominance.

The Sequence Knowledge #804: The Dreamer Trilogy: Inside Some of the Most Influential Papers in AI World Models

TheSequence • 28 implied HN points • 10 Feb 26

🕹 Technology Computer Vision

The Dreamer trilogy of papers reshaped how researchers build and use world models in AI.
Model-based reinforcement learning inspired modern world models, focusing on agents that learn internal predictive models instead of directly mapping pixels to actions.
Model-free methods like DQN succeeded in 2D games but struggled in complex 3D environments such as DeepMind Lab and Minecraft, revealing the limits of purely reactive agents and motivating the shift to world models.

The Sequence Knowledge #792: EVERYTHING you Need to Know About Synthetic Data Generation

TheSequence • 49 implied HN points • 20 Jan 26

🕹 Technology Computer Vision

Synthetic data is a practical scaling lever that fills coverage gaps and builds long-tail capabilities by creating targeted examples instead of waiting for rare real-world labels.
Core methods include generative synthesis, rephrasing/paraphrasing, multi-turn dialogue synthesis, and RL trajectory generation, each tailored to different tasks like images, instructions, conversations, or environment rollouts.
The focus is on quality over quantity: tight specs, automatic verification, diversity controls, and eval-driven feedback let teams steer capabilities, improve class balance, protect privacy, and iterate quickly.

Finally—letters in Scroll 4!

Vesuvius Challenge • 64 implied HN points • 21 Dec 25

🔬 Science Computer Vision

A new high-resolution tomographic scan (2.4 µm pixels, 78 keV, 22 cm propagation) revealed 5–6 mm letters in PHerc. 1667 that were invisible in earlier 8 µm scans.
A generalist ink-detection model trained on other fragments detected letters immediately without scroll-specific labeling, suggesting the method can find ink across different scrolls.
The team is retiring the First Letters and First Title prizes to focus on extracting text, and they doubled the Kaggle competition prize pool to $200,000 while preparing an updated dataset.

Da fuq is an “LLM?”

Sex and the State • 26 implied HN points • 14 Jan 26

🕹 Technology Computer Vision

An LLM (large language model) is an AI system that mainly reads and writes natural language and powers modern chatbots like ChatGPT, Claude, and Gemini.
AI is a big umbrella with many types of tools — image generators, detectors, chat interfaces, and world models — and LLMs are just the language-focused slice, not the same as models that work with images or spatial data.
Many leading researchers argue LLMs alone probably won’t produce human-level or general intelligence, because language only points to thought; building AGI likely requires spatial or "world" models that learn from videos, perception, and interaction.

Why We Must Build World Models

Reasons to Be Optimistic • 6 implied HN points • 17 Feb 26

🕹 Technology Computer Vision

Text-only models are powerful but incomplete because language misses how the world actually looks, moves, and feels; video offers a far richer, high-volume source of physics, sound, and human behavior.
True world models must be causal and action-conditioned, predicting the next state step-by-step under intervention; autoregressive diffusion transformer architectures trained on multimodal video and actions are a promising path.
General world models will turn naive software into systems that understand and interact with the real world, enabling adaptive robots, immersive simulations, new learning tools, and large-scale scientific discovery.

The Sequence Knowledge #784: The Convergence of Synthetic Data and World Models Models Are Unlocking Embodied AI

TheSequence • 28 implied HN points • 06 Jan 26

🕹 Technology Computer Vision

Collecting high-quality, perfectly labeled 3D data from the real world is slow, expensive, and misses rare edge cases, so 'reality' is the main bottleneck for embodied AI.
Pairing synthetic data generation with world models lets teams create rich, diverse, and labeled simulated environments, so agents can be trained and tested without costly real-world collection.
New world models like Google DeepMind's Genie show this approach in action by enabling interactive, dynamic 3D simulations where robots and autonomous vehicles can learn more robust behaviors.

The Year in Image Generation

Jakob Nielsen on UX • 23 implied HN points • 29 Dec 25

🕹 Technology Computer Vision

Image rendering is no longer the bottleneck; creators can cheaply produce many bespoke variations, so the scarce resource is attention and editorial selection — the best images earn attention by adding clarity, not noise.
Image models have moved from drawing single objects to composing multi-concept scenes and full layouts, and different models trade visual lushness for prompt adherence; creators need to pick or switch models based on the task and content rules.
AI-generated infographics and comics can look authoritative but still hallucinate facts or structure, so people must verify and correct outputs even as hallucinations steadily decline.

The New Era of Efficient LLM Deployment

Gradient Flow • 299 implied HN points • 13 Jul 23

🕹 Technology Computer Vision

AI tools are becoming pervasive in tech with potential to increase productivity and contribute trillions annually to global productivity
Efficient deployment of large language models (LLMs) is crucial for businesses to scale their AI initiatives and drive digital innovation
Rethinking MLOps infrastructure is essential to accommodate the scale and complexity of LLMs, with a need for solutions addressing challenges in inference, serving, and deployment

The Sequence Knowledge # 780: Synthetic Data for Image Models

TheSequence • 21 implied HN points • 30 Dec 25

🕹 Technology Computer Vision

Synthetic image data is now a core tool for vision models and works especially well when real images are scarce, private, or unbalanced by providing labeled pixels and covering rare edge cases.
Modern generative models (diffusion models, GANs) combined with conditional controls like segmentation, depth, keypoints, ControlNet, or LoRA let you steer layout, pose, lighting, and style; typical pipelines script prompts, generate images, and auto-label using the same controls.
Success depends on choosing the right generator and control signals and running a rigorous quality-control loop so synthetic variety actually improves downstream performance, a pattern already used in systems like NVIDIA’s Synthetica for robot training.

Math Discovery, Long-Context Memory, and the Limits of Multimodal Reasoning

HackerPulse Dispatch • 13 implied HN points • 19 Dec 25

🕹 Technology Computer Vision

AlphaEvolve demonstrates AI agents can autonomously discover and improve mathematical constructions, generalize finite solutions into universal formulas, and integrate with proof assistants for verification.
MMGR shows that image and video models produce convincing visuals but largely fail at causal and abstract reasoning (often <10% accuracy), revealing a major gap between perceptual quality and true world understanding.
Advances in model design and decoding are pushing capabilities: QwenLong-L1.5 enables reasoning over 4M-token contexts using synthetic multi-hop data, stabilized RL, and memory-augmented architectures, and ReFusion speeds text generation by decoding in parallel with a plan-and-infill diffusion approach.

Market Map & Analysis: AI Synthetic Data Companies

The Strategy Deck • 78 implied HN points • 06 Jul 23

🕹 Technology Computer Vision

Synthetic data is crucial for ML by replacing real-world data, protecting sensitive information, and validating AI applications.
Synthetic data is used in computer vision for autonomous vehicles and is expanding to other data types like text and tabular data.
There are specialized and general-purpose synthetic data platforms developing innovative solutions for various industries and use cases.

Clearpath/Otto Motors acquisition north of $600m, startup deadlines and Ghost Sharks

Robots & Startups • 59 implied HN points • 17 Sep 23

🕹 Technology Computer Vision

AI changing work, focusing on consultants and big players in autonomous delivery.
Clearpath/Otto Motors acquisition exceeding $600 million.
Some startups emphasized small planes with computer vision, rather than drones, for specific solutions.

🧠 Universal Weights, Live Avatars, and the Limits of Data Agents

HackerPulse Dispatch • 5 implied HN points • 12 Dec 25

🕹 Technology Computer Vision

Neural networks trained on diverse tasks tend to converge to similar low-dimensional weight subspaces, implying a shared parametric backbone that could make transfer learning and model reuse much more efficient.
System-and-algorithm co-design now enables large diffusion models to run in real time for streaming avatars (20 FPS on a 14B model), showing practical deployment of big generative models for live video.
A 210-task benchmark shows current data agents succeed on under 20% of engineering tasks and under 40% of analysis tasks, revealing major gaps in orchestration and reasoning for enterprise workflows.

AI Observability, Orchestration, Consolidation

Gradient Flow • 179 implied HN points • 26 May 22

🕹 Technology Computer Vision

Companies are likely to use at most two platforms for managing the entire machine learning pipeline: one for exploration and another for deployment and operations.
Prefect 2.0 is a popular framework for data and workflow orchestration, emphasizing 'code as workflows' to address data engineering challenges.
The survey on workflow orchestration tools revealed a growing interest in these systems, with startups raising over $450 million in funding for orchestration solutions.

Must Learn AI Security Part 23: Blurring or Masking Attacks Against AI

Rod’s Blog • 39 implied HN points • 19 Oct 23

🕹 Technology Computer Vision

Blurring or masking attacks against AI involve manipulating input data like images or videos to deceive AI systems while keeping content recognizable to humans.
Common types of blurring and masking attacks against AI include Gaussian blur, motion blur, median filtering, noise addition, occlusion, patch/sticker, and adversarial perturbation attacks.
Blurring or masking attacks can lead to degraded performance, security risks, safety concerns, loss of trust, financial/reputational damage, and legal/regulatory implications in AI systems.

ML Data Labeling Tools Structure Information to Make It Meaningful - Market Map and Analysis

The Strategy Deck • 39 implied HN points • 17 Jul 23

🕹 Technology Computer Vision

Data labeling is crucial for improving the quality of ML models by adding meaningful labels.
Data labeling tools offer features like support for various data types, collaboration between annotators, and data versioning.
ML platforms for data labeling include multi-modal, general purpose tools for manual labeling and programmatic tools focusing on specific data types and niches.

AI will predict what you'll buy in two years; is AI the future of film-making? 16 examples of how open-source LLMs are used today; the US is building AI infrastructure for academic researchers;

Computerspeak by Alexandru Voica • 19 implied HN points • 02 Feb 24

🕹 Technology Computer Vision

AI is playing a significant role in various industries, from predicting consumer behavior to improving movie-making processes, indicating a growing reliance on AI technology.
Companies like Amazon, Google, Meta, and Microsoft are investing in custom AI chips and developing AI assistants to enhance their services and offerings.
Advancements in AI, particularly in natural language processing and computer vision, are shaping the future of ecommerce by enabling personalized, engaging, and context-aware experiences for customers.

The Sequence Knowledge #545 : Beyond Language, Learning About Multimodal Benchmarks

TheSequence • 28 implied HN points • 20 May 25

🕹 Technology Computer Vision

Multimodal benchmarks are tools to evaluate AI systems that use different types of data like text, images, and audio. They help ensure that AI can handle complex tasks that combine these inputs effectively.
One important benchmark in this area is called MMMU, which tests AI on 11,500 questions across various subjects. This benchmark needs AI to work with text and visuals together, promoting deeper understanding rather than just shortcuts.
The design of these benchmarks, like MMMU, helps reveal how well AI understands different topics and where it may struggle. This can lead to improvements in AI technology.

Vesuvius Challenge Progress Prizes: December Edition

Vesuvius Challenge • 31 implied HN points • 24 Jan 25

🕹 Technology Computer Vision

The community is focused on improving data quality, like using better labels and refining how they categorize information. This will help them create automated tools for analyzing scrolls more effectively.
Several contributors have made significant advancements in developing new segmentation models and tools, which will help in analyzing scroll data. These innovations are key for understanding ancient texts.
2024 has been a great year for teamwork and progress as everyone shares their findings. The hard work from many people is leading to quick improvements in technology for studying historical scrolls.

FunctionGemma, GPT‑5.2-Codex, Chatterbox Turbo, A2UI, Seedance 1.5 pro, GPT Image 1.5, SAM Audio, Wan2.6, LongCat-Video-Avatar, Mistral OCR 3, Ray3 Modify, FLUX.2 [max] and more

AI Brews • 2 implied HN points • 19 Dec 25

🕹 Technology Computer Vision

AI development is accelerating around multimodal and audio‑video capabilities, with many new models that generate or edit high‑quality video, isolate sounds, and produce expressive, lip‑synced audio.
The agent and developer ecosystem is maturing fast — plugin marketplaces, open agent standards, memory‑first agents, and UI/ workflow tools are making it much easier to build, extend, and deploy agentic applications.
Open‑source and specialized releases are raising the bar for core capabilities like OCR, 3D view synthesis, image generation, code/documentation automation, and semantic search, bringing more practical AI tools to developers and creators.