The hottest Multimodal models Substack posts right now

And their main takeaways

The Sequence Chat #814: Z.ai's Zixuan Li Talks About GLM

TheSequence • 266 implied HN points • 26 Feb 26

🕹 Technology Multimodal models

GLM’s core idea is to blend bidirectional understanding with strong generation using autoregressive blank infilling. It uses Mixture-of-Experts so different experts can specialize, making the model more versatile across tasks.
Open-sourcing model weights is a deliberate strategy to grow the developer ecosystem, lower barriers, and help set standards, while commercial demand is captured via managed services and enterprise support.
GLM-5 focuses on efficiency and long-horizon agent capabilities by combining sparse expert activation, sparse attention, and an asynchronous RL pipeline called slime to improve sustained planning. Product challenges for device agents are mainly error recovery and long-term context rather than just latency, and pricing may shift from tokens to outcome-based value.

Latest open artifacts (#18): Arcee, LiquidAI and Moonshot ...

Democratizing Automation • 142 implied HN points • 02 Feb 26

🕹 Technology Multimodal models

Arcee released Trinity-Large-Preview, an ultra-sparse MoE with 400B total parameters and about 13B active parameters, plus a public tech report and base models.
LiquidAI’s LFM2.5-1.2B-Instruct punches above its size, often matching larger models in tests and coming with Japanese, vision, and audio variants.
Kimi-K2.5 is a multimodal continual-pretrain model (15T tokens) that’s cheaper and stronger on coding and agent tasks, though its writing quality has slipped compared to earlier K2 models.

2025 Open Models Year in Review

Democratizing Automation • 292 implied HN points • 14 Dec 25

🕹 Technology Multimodal models

Open models made a dramatic jump in 2025, matching closed models on many benchmarks and becoming realistic options for real-world deployments beyond just privacy or fine-tuning.
A few breakout releases — notably DeepSeek R1, Qwen 3, and Kimi K2 — had outsized influence, driving wider adoption and encouraging more open licensing from major labs, especially in China.
The ecosystem exploded in scale and variety, with thousands of new models uploaded monthly, clear specialist niches and a public tiering of makers, leaving open models established and poised for further growth in 2026.

Import AI 361: GPT-4 hacking; theory of minds in LLMs; and scaling MoEs + RL

Import AI • 359 implied HN points • 19 Feb 24

🕹 Technology Multimodal models

Researchers have discovered how to scale up Reinforcement Learning (RL) using Mixture-of-Experts models, potentially allowing RL agents to learn more complex behaviors.
Recent research shows that advanced language models like GPT-4 are capable of autonomous hacking, raising concerns about cybersecurity threats posed by AI.
Adapting off-the-shelf AI models for different tasks, even with limited computational resources, is becoming easier, indicating a proliferation of AI capabilities for various applications.

Import AI 358: The US Government’s biggest AI training run; hacking LLMs by hacking GPUs; chickens versus transformers

Import AI • 319 implied HN points • 29 Jan 24

🕹 Technology Multimodal models

Hackers can exploit GPU vulnerabilities to read data from LLM sessions, highlighting security risks in AI infrastructures.
AI will enhance cyberattacks and empower malicious actors, posing a significant threat to cybersecurity by increasing efficiency and sophistication of attacks.
The US government conducted a substantial AI training run but lags behind private industry, showcasing the need for advancements in supercomputing capabilities for large-scale AI models.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Agentic AI: Creating An AI Agent Which Can Navigate The Internet

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 17 Jul 24

🕹 Technology Multimodal models

WebVoyager is an AI agent that can browse the web by analyzing screenshots and deciding what to do next. It works like a human browsing the internet, using both visual and text information.
The agent interacts with webpages by performing actions like clicking, scrolling, and typing. This allows it to complete tasks on websites without needing help from humans.
WebVoyager's ability to handle complex web navigation shows the potential of AI agents to perform useful tasks autonomously. It learns to navigate better by using real-world websites rather than just simplified models.

The Sequence AI of the Week #769: Inside Gemini Deep Think

TheSequence • 14 implied HN points • 10 Dec 25

🕹 Technology Multimodal models

Gemini Deep Think is a “thinking layer” added on top of large multimodal models that turns a mixture-of-experts into a coordinated swarm of small reasoning agents.
It runs parallel, coordinated inference-time processes, which let it solve very hard problems and achieve state-of-the-art results on benchmarks like Olympiad-level math.
The key insight is that how you use compute at inference time matters as much as raw parameter count, pushing future model design toward dynamic runtime strategies.

Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions

Democratizing Automation • 126 implied HN points • 10 Jan 24

🕹 Technology Multimodal models

Multi-modal models are advancing to complement information processing capabilities by incorporating diverse inputs and outputs.
Unified IO 2 introduces a novel autoregressive multimodal model capable of generating and understanding images, text, audio, and action through shared semantic space processing.
LLaVA-RLHF explores new factually augmented RLHF techniques and datasets to bridge misalignment between different modalities and enhance multimodal models.

FunctionGemma, GPT‑5.2-Codex, Chatterbox Turbo, A2UI, Seedance 1.5 pro, GPT Image 1.5, SAM Audio, Wan2.6, LongCat-Video-Avatar, Mistral OCR 3, Ray3 Modify, FLUX.2 [max] and more

AI Brews • 2 implied HN points • 19 Dec 25

🕹 Technology Multimodal models

AI development is accelerating around multimodal and audio‑video capabilities, with many new models that generate or edit high‑quality video, isolate sounds, and produce expressive, lip‑synced audio.
The agent and developer ecosystem is maturing fast — plugin marketplaces, open agent standards, memory‑first agents, and UI/ workflow tools are making it much easier to build, extend, and deploy agentic applications.
Open‑source and specialized releases are raising the bar for core capabilities like OCR, 3D view synthesis, image generation, code/documentation automation, and semantic search, bringing more practical AI tools to developers and creators.

Meet the AI researcher that is unamused with certain uses for AI.

superartificial • 19 implied HN points • 15 Mar 23

🕹 Technology Multimodal models

AI researcher Meredith Broussard warns about harmful applications of AI, emphasizing the importance of considering social factors.
OpenAI's GPT-4 upgrade will allow turning text into video, with caution advised by CEO Sam Altman.
ChatGPT has reached over 100 million users, partnering with Microsoft and facing criticism from Elon Musk.

GPT4: The quiet parts and the state of ML

Democratizing Automation • 90 implied HN points • 20 Mar 23

🕹 Technology Multimodal models

GPT4 marks a significant transition in the field of AI with large models gaining attention.
Technical discussions around GPT4 emphasize exploiting existing infrastructure and long context windows.
Societal implications of GPT4 raise concerns about safety, ethics, and power structures in AI.

Meta's V-JEPA vision models, OpenAI's Sora video model, Gemini 1.5 Pro with 1 million tokens context, Reka Flash, Largest text-to-speech AI model and more

AI Brews • 32 implied HN points • 16 Feb 24

🕹 Technology Multimodal models

OpenAI introduced Sora, a text-to-video model capable of creating detailed videos up to 60 seconds long with vibrant emotions.
Meta AI unveiled V-JEPA, a method for teaching machines to understand the physical world by watching videos, using self-supervised learning for feature prediction.
Google announced Gemini 1.5 Pro with a context window of up to 1 million tokens, allowing for advanced understanding and reasoning tasks across different modalities like video.

Can LLMs earn $1M freelancing?

HackerPulse Dispatch • 5 implied HN points • 21 Feb 25

🕹 Technology Multimodal models

AI models are being tested to see if they can earn a million dollars through freelancing. But it turns out many of them struggle with real-world tasks.
A new video model can create high-quality videos from text descriptions. It uses advanced techniques to improve video quality and generation.
Small AI models can perform better when they are trained on easier tasks instead of trying to learn from more complex ones.

Open source AI voice cloning, Meta's full-bodied photorealistic avatars from audio, Mobile-ALOHA and more

AI Brews • 17 implied HN points • 05 Jan 24

🕹 Technology Multimodal models

Meta introduces a framework for generating photorealistic avatars from audio
MyShell presents an open-source voice cloning approach for granular control of tone
Stanford University introduces Mobile-ALOHA for autonomous complex mobile manipulation tasks

AntiSuckers' Note #3

Anti-Suckers • 4 implied HN points • 12 Mar 23

🕹 Technology Multimodal models

The Anti-Suckers' Note includes tech news, recommendations, and philosophical insights
Midjourney V5 is soon to be released with image rating features
GPT-4, a multimodal model, is expected to be introduced by Microsoft Germany

Mixture of Experts LLM shows real promise in healthcare; the future of AI is multimodal and multilingual; why AI is having a 1995 moment; a closer look at diffusion transformers;

Computerspeak by Alexandru Voica • 0 implied HN points • 01 Mar 24

🕹 Technology Multimodal models

Generative AI models like BiMediX, PALO, and GLaMM are advancing healthcare, language models, and image understanding in multilingual settings.
Innovative models like MobilLlama aim to make AI more accessible by running on affordable hardware and being optimized for mobile devices.
AI applications in various industries, such as journalism, construction, and e-commerce, are enhancing safety, optimizing workflows, and transforming user experiences.