The hottest Multimodal Substack posts right now

A new wave of flagship open-weight models from Chinese labs (like Qwen 3.5, GLM-5, MiniMax-M2.5, and StepFun) is pushing architectures such as MoE and hybrid dense variants, and many releases are multimodal with reasoning enabled by default.
Adoption patterns are surprising: a normalized metric shows unexpected winners and losers — some smaller or open-source models (e.g., GPT-OSS, Kimi K2, OCR models) have very high early adoption while notable releases like DeepSeek V3.2 have underperformed.
The ecosystem is maturing and commercializing — demand has already driven price increases for large models, smaller models can rival much larger ones on benchmarks, and there’s rising focus on agentic reasoning plus long-context and sparse-attention capabilities.

Bluesky builds Discover personalization from fixed post embeddings (BLIP2) plus broad topic labels and finer HDBSCAN clusters to track user interests, after an initial two‑tower retrieval approach didn’t work out.
PinnerSage captures diverse short‑ and long‑term interests by clustering a user’s recent interactions into many medoids, scoring each cluster with a time‑decay importance, and using those medoids as weighted seeds for ANN candidate retrieval.
Multiple per‑user medoids ease retrieval but complicate ranking, so the plan is to use PinnerSage for candidate generation and then adopt a transformer (PinnerFormer) to create a single user embedding for efficient, accurate ranking.

2026 is the Integration Era: AI stops being a party trick and gets embedded into work and products through autonomous agents, generative UIs, and multimodal/physical capabilities. User experience and agent management, not raw model IQ, become the primary business differentiators.
A compute-driven two-tier world will emerge: persistent shortages and costly inference mean premium subscribers get powerful, multimodal agents while most people use weaker, eco-models. This forces tiered pricing, compute-aware product design, and widens professional and economic divides.
Human roles shift toward judgment, oversight, and trust work: people will focus on setting goals, auditing agent decisions, designing guardrails, and training via apprenticeships. New risks like AI-powered dark patterns will create demand for defensive agents, governance, and stronger UX ethics.

OpenAI released multimodal capabilities for ChatGPT, allowing interaction with chatbots through images and audio.
The rollout of multimodal features was gradual, starting with select premium users, and the timeline for wider access is still unclear.
The potential for voice interaction in AI is seen as a significant advancement, promising more intuitive user experiences than text input.

Generative AI technology is advancing rapidly and impacting the development of products.
Dubverse tools focus on converting communication artifacts across languages using various modalities.
Challenges in language translation can be addressed through emerging Generative AI techniques.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Neural networks trained on diverse tasks tend to converge to similar low-dimensional weight subspaces, implying a shared parametric backbone that could make transfer learning and model reuse much more efficient.
System-and-algorithm co-design now enables large diffusion models to run in real time for streaming avatars (20 FPS on a 14B model), showing practical deployment of big generative models for live video.
A 210-task benchmark shows current data agents succeed on under 20% of engineering tasks and under 40% of analysis tasks, revealing major gaps in orchestration and reasoning for enterprise workflows.

AI is rapidly advancing, especially in the medical field.
New technology like ImageBind can link different types of data with images as a common basis.
Fine-tuning language models with a small number of prompts can significantly improve performance.

There's a new textbook on RLHF being written that aims to help readers learn and improve the content through feedback.
Qwen 2.5 models are showing strong performance, competing well with models like Llama 3.1, but have less visibility in the community.
Several new models and datasets have been released, including some interesting multimodal options that can handle both text and images.