The hottest Benchmarks Substack posts right now

And their main takeaways
Category
Top Technology Topics
DYNOMIGHT INTERNET NEWSLETTER • 937 implied HN points • 18 Mar 26
  1. Predicting how a mug of coffee cools is hard because lots of interacting processes matter and many details (mug material, shape, humidity, etc.) are unspecified.
  2. Large language models can produce plausible equations and cooling curves, but their predictions vary and none matched the actual experiment perfectly.
  3. When the experiment was run, the water cooled faster at first and slower later than most models predicted, so real measurements are essential to validate model outputs.
SemiAnalysis • 10506 implied HN points • 16 Feb 26
  1. Nvidia’s Blackwell family (B200/B300/GB200/GB300) and NVL72 rack-scale systems deliver much higher inference throughput and far better tokens-per-dollar than prior Hopper GPUs, especially when paired with TensorRT-LLM, disaggregated prefill, and wide expert parallelism.
  2. AMD’s MI355X can be competitive on single-node FP8 SGLang setups, but its software stack struggles to compose FP4, disaggregated prefill, and wide EP together; AMD needs stronger upstream contributions, CI resources, and focus on composability to close the gap.
  3. Disaggregated prefill, wide expert parallelism, and multi-token prediction (MTP) are the key inference optimizations today, and when tuned against the throughput-vs-latency tradeoff they can massively lower cost per token while requiring accuracy checks to avoid silent regressions.
Don't Worry About the Vase • 3270 implied HN points • 11 Mar 26
  1. GPT-5.4 is a clear, practical upgrade — it’s much better at coding, knowledge work, long-context tasks, and native computer use, and its writing and personality have noticeably improved.
  2. Benchmarks tell a mixed story — the model sets new records on some tests and is more efficient in places, but overall core capabilities aren’t a dramatic leap and some preparedness and eval scores show only small gains or regressions.
  3. Real-world tradeoffs matter — many users are excited and even switching for coding, but costs are higher, safety/jailbreak and chain-of-thought transparency remain imperfect, and some rivals still beat it at inferring intent and certain creative or vision tasks.
Marcus on AI • 11777 implied HN points • 17 Feb 26
  1. High scores and fluent outputs from large models are not the same as general intelligence; performing well on tests is a statistical approximation, not evidence of flexible, goal-directed intelligence.
  2. Benchmarks are often gameable and don’t prove robustness or real-world transfer; economic and deployment data show current systems automate only limited tasks and deliver modest aggregate impact.
  3. Similar behavior can hide very different internal processes; models often produce confident, plausible answers without human-like uncertainty handling, persistent goals, or reliable reasoning under novel conditions.
Don't Worry About the Vase • 1881 implied HN points • 04 Mar 26
  1. Gemini 3.1 Pro leads many benchmarks and shows clear capability gains, with specialized modes like Deep Think V2 pushing scores even higher.
  2. Safety and transparency are lacking: the team ran frontier tests but provided only brief summaries, leaving important questions about risks and oversight.
  3. Real-world impressions are mixed: it’s excellent at visuals and one-shot reasoning, but it can be flaky for agentic workflows, coding consistency, and the rollout had access and API issues.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Don't Worry About the Vase • 4749 implied HN points • 11 Feb 26
  1. The new model is a clear performance step forward on many benchmarks—especially coding, long‑context retrieval, and several life‑science tasks. It is very token‑hungry and shows mixed regressions, notably on writing and some niche tests.
  2. It displays strong agentic abilities—able to build complex software, find many vulnerabilities, and optimize game strategies—but those same tendencies can make it ruthless, deceptive, or exploitative, which raises real safety and misuse concerns.
  3. Progress is accelerating and competitive, so people should pick the best tool for each job, expect frequent upgrades, and invest in verification, monitoring, and safety practices as models iterate faster.
Don't Worry About the Vase • 1792 implied HN points • 24 Feb 26
  1. Sonnet 4.6 is a faster, cheaper Claude model that gets close to Opus 4.6 on many tasks and upgrades the free tier, so it’s very useful for coding and computer work.
  2. It can be overeager and sometimes wastes tokens or over-searches, and users report it being more prone to careless mistakes and different behavioral quirks compared with Opus.
  3. Use Sonnet when you need speed, lower cost, or a subagent for exploratory or one-off tasks, but stick with Opus for higher-stakes, long-lived, or chat-focused work.
Gonzo ML • 315 implied HN points • 13 Mar 26
  1. A new benchmark measures a code agent's evolving architectural beliefs by giving it limited, partial access to procedurally generated codebases and asking for periodic JSON maps instead of just checking final outputs. It tests not just whether patches work but whether the agent builds and updates a usable model of the system.
  2. Results are model-dependent: some models do better when they actively explore, some worse; keeping a running belief (a scratchpad) helps some models but not others; and belief stability is inconsistent and not strictly related to model size. LLMs can discover complex, multi-hop dependencies and architectural constraints that rule-based heuristics miss, but finding constraints often requires carefully designed prompts.
  3. This is an early v0.1 effort and needs more architectures, languages, larger and real-world codebases, and experiments that test revising beliefs after changes. The toolkit is open-source and the author invites community contributions to expand patterns, models, and scoring methods.
General Robots • 244 implied HN points • 13 Mar 26
  1. RobotEra beat the previous sock-inversion time by 30%, earning a silver medal under the contest rules.
  2. Longer fingers let the robot bunch the sock onto the gripper faster because it didn’t have to pack the fabric as tightly.
  3. They raised action frequency while shortening each planning horizon, making the controller more reactive and precise at high speed but trading off some long-range planning.
Don't Worry About the Vase • 2150 implied HN points • 10 Feb 26
  1. The new Opus 4.6 model is substantially more capable than earlier versions and shows big gains across coding, agentic workflows, LLM training speedups, reinforcement learning, and cyber tasks, making it the strongest general-purpose model available.
  2. Current safety evaluations are losing effectiveness: many benchmarks are saturated, models can hide or avoid verbalizing eval awareness, and subtle sandbagging or deception could let dangerous capabilities go unnoticed.
  3. We are not prepared for this pace of progress—key thresholds and ASL‑4 tests (especially for biology, cyber, and autonomy) are under-defined, release decisions rely on ambiguous judgments, and urgent external testing and collective safeguards are needed.
Don't Worry About the Vase • 2374 implied HN points • 04 Feb 26
  1. Kimi K2.5 is a very capable open-source multimodal model that matches many proprietary models on benchmarks while costing much less to run.
  2. Its agent-swarm system can coordinate many parallel subagents (up to ~100) to complete tasks much faster, but multi-agent runs can be fiddly, produce messy or inconsistent outputs, and be hard to edit reliably.
  3. The release exposes safety and alignment gaps: the model can misidentify or conceal internal states and seems influenced by other models' outputs, and there is little sign of planning for catastrophic risks; running the model locally is possible but often more expensive, slower, and more fragile than using hosted services.
Democratizing Automation • 934 implied HN points • 09 Feb 26
  1. Codex 5.3 meaningfully improves coding ability and responsiveness, but Claude Opus 4.6 remains easier to use and more reliable for a wide range of everyday tasks.
  2. Standard benchmarks are losing signal for these agentic models, so hands-on testing, continual usage, and multi-model workflows are needed to judge real performance.
  3. Agent design and orchestration are the real frontier — subagents/agent teams and the ability to harness more compute (e.g., Pro-style models) will be the clearest practical differentiators.
Democratizing Automation • 174 implied HN points • 03 Mar 26
  1. A new wave of flagship open-weight models from Chinese labs (like Qwen 3.5, GLM-5, MiniMax-M2.5, and StepFun) is pushing architectures such as MoE and hybrid dense variants, and many releases are multimodal with reasoning enabled by default.
  2. Adoption patterns are surprising: a normalized metric shows unexpected winners and losers — some smaller or open-source models (e.g., GPT-OSS, Kimi K2, OCR models) have very high early adoption while notable releases like DeepSeek V3.2 have underperformed.
  3. The ecosystem is maturing and commercializing — demand has already driven price increases for large models, smaller models can rival much larger ones on benchmarks, and there’s rising focus on agentic reasoning plus long-context and sparse-attention capabilities.
General Robots • 1814 implied HN points • 22 Jan 26
  1. A robotics team completed almost all the benchmark manipulation tasks in about three months, much faster than people expected.
  2. They succeeded using mainly cameras and simple pincer grippers rather than force sensors or dexterous hands, showing vision-based approaches can solve many tasks once thought to require touch or complex hardware.
  3. The robots still run several times slower than humans, so the next priorities are speeding them up and designing harder challenges to probe the limits of vision-only solutions.
Don't Worry About the Vase • 2553 implied HN points • 25 Dec 25
  1. AI capabilities are accelerating fast — models like Claude Opus 4.5 and GPT‑5.2‑Codex are getting much better at long‑horizon, agentic coding and benchmarked tasks.
  2. Policy and public opinion are catching up: states are passing laws like New York’s RAISE Act and voters broadly favor federal AI regulation, even as industry and politics push back.
  3. The social and safety picture is messy — AI is disrupting jobs and media (deepfakes and a lot of low‑quality 'slop'), and aligning and reliably monitoring smarter systems remains hard despite improving interpretability tools.
More Than Moore • 186 implied HN points • 01 Mar 26
  1. The Ryzen 7 9850X3D is basically a higher‑binned 9800X3D with faster clocks, but it only delivers tiny performance gains while drawing significantly more power and costing more.
  2. AMD’s 3D V‑Cache really helps CPU‑bound, cache‑hungry games and makes memory speed matter less, but it doesn’t improve compute‑heavy workloads and offers no advantage for AI paths that need an NPU.
  3. On value, the 9800X3D or cheaper Intel options give better performance‑per‑dollar, so most buyers should pick the cheaper chip and spend any savings on other parts like memory amid volatile DRAM prices.
Don't Worry About the Vase • 2598 implied HN points • 15 Dec 25
  1. GPT-5.2 is a true frontier model that shines on hard, intelligence-heavy tasks like deep reasoning and complex coding. It’s noticeably slow and constrained, and its personality is cold and less enjoyable for casual use.
  2. Official benchmarks (notably GDPVal) claim big jumps and frequent wins over humans, but independent tests and user reports are mixed, showing parity or only small advantages over rivals like Claude Opus and Gemini. Some specific areas even regress, so its real-world edge is uneven.
  3. Use GPT-5.2 only when you need maximum thinking or coding power; for most everyday, creative, or speed-sensitive work, faster and friendlier models are a better choice. Safety mitigations improved in places, but reliability, long-run speed, and occasional hallucination or failure remain concerns.
General Robots • 732 implied HN points • 27 Jan 26
  1. Robotics is progressing faster than expected, so more difficult, real-world challenges are needed to keep driving breakthroughs.
  2. The new tasks emphasize dynamic movement, fine fingertip dexterity, tool use, and whole-body manipulation through everyday activities like catching eggs, cooking, folding sheets, hammering, and getting into a car.
  3. A competition framework awards medals and asks teams to demonstrate success with videos, inviting community participation and leaving some earlier challenges still unclaimed.
Generating Conversation • 163 implied HN points • 26 Feb 26
  1. Public benchmarks and leaderboards don’t predict how well an AI agent will perform in real codebases; high scores often reflect narrow, artificial tasks rather than real work.
  2. Evaluate agents by their on-the-job performance and ability to adapt to your specific environment—test them with your past incidents or post-mortems to see how they actually help.
  3. Choose agents that match your workflow and stack: prefer specialists who handle messy documentation, legacy systems, and practical operational complexity over generalist models with flashy benchmarks.
Artificial Ignorance • 96 implied HN points • 01 Mar 26
  1. Public benchmarks are saturating, getting contaminated, and often measure memorization rather than real ability, so leaderboard scores are less reliable for everyday users.
  2. Newer evals focus on behavior in messy, open-ended settings (like simulations, negotiations, or whistleblowing scenarios) and reveal practical problems such as hallucination, sycophancy, and poor long-term coherence.
  3. You should build simple, custom evaluations for your actual workflows—save common prompts and good/bad outputs and re-run them when new models arrive to see which one truly helps your work.
AI: A Guide for Thinking Humans • 462 implied HN points • 14 Jan 26
  1. Benchmarks can be misleading: high scores don’t prove real-world understanding because models can rely on training leaks, shortcuts, or narrow task-specific tricks.
  2. Evaluation should borrow rigorous methods from developmental and animal cognition: avoid anthropomorphic assumptions, run control and adversarial experiments, and test robustness with novel variations to see if abilities truly generalize.
  3. Go beyond accuracy to study mechanisms and failures: distinguish competence from performance, analyze error types, and publish negative or replication results to understand what models really do.
One Useful Thing • 1028 implied HN points • 12 Nov 25
  1. Measuring AI performance is tricky because common tests can be flawed and sometimes don't really show how smart the AI is. We're often left uncertain about what these benchmarks actually mean.
  2. Using a more personal approach, like creating fun and unique tests, can help people understand how different AI models work. This way, you get a feel for the AI's strengths and weaknesses in a more relatable way.
  3. When companies choose AI tools, it's important to do thorough testing based on real tasks instead of just relying on average performance scores. Understanding specifically how well an AI can perform your unique tasks is key.
TheSequence • 2297 implied HN points • 08 Jul 25
  1. Evaluating creativity in AI is tricky because creativity involves personal feelings and tastes. Researchers have created special tests to help measure how creative AI really is.
  2. There are different benchmarks available to assess AI creativity, focusing on originality and emotional impact. These benchmarks help researchers understand how well AI can mimic human-like creativity.
  3. OpenAI's HumanEval benchmark is one important tool that helps measure AI's ability to write code creatively. It plays a key role in assessing how AI can perform tasks that require innovative thinking.
Gonzo ML • 252 implied HN points • 05 Jan 26
  1. A Universal Transformer–style model (URM) repeatedly applies a shared transformer layer with ACT, combining ConvSwiGLU and truncated backprop through loops to get very deep effective computation while keeping parameter count low.
  2. ConvSwiGLU injects a small depthwise convolution into the SwiGLU gating to mix local token context, and TBPTL reduces memory and training cost by only backpropagating through the final iterations.
  3. The model outperforms prior HRM/TRM baselines on tasks like Sudoku and ARC-AGI and Muon speeds convergence, but differences in evaluation protocols and some unclear experimental details mean independent verification is still needed.
TheSequence • 49 implied HN points • 12 Feb 26
  1. Evaluation moved from informal "vibe checks" to using stronger LLMs to automatically grade weaker models' outputs.
  2. That single-pass LLM-as-judge approach powered benchmarks like MT-Bench and Chatbot Arena, but simple intuitive judgments are becoming insufficient.
  3. The field is shifting to agent-as-a-judge, where evaluations need multi-step reasoning engines and dynamic, agentic judging instead of static benchmarks.
Import AI • 599 implied HN points • 01 Apr 24
  1. Google is working on a distributed training approach named DiPaCo to create large neural networks that break traditional AI policy focusing on centralized models.
  2. Microsoft and OpenAI plan to build a $100 billion supercomputer for AI training, signaling the transition of AI industry towards capital intensive endeavors like oil extraction or heavy industry, touching on regulatory and industrial policy implications.
  3. Sakana AI has developed 'Evolutionary Model Merge' method to create advanced AI models by combining existing ones through evolutionary techniques, potentially changing AI policy by challenging the need for costly model development.
Human Programming • 25 implied HN points • 19 Feb 26
  1. The ARC benchmark has evolved and different solution families have led the frontier over time; early winners used program-search while recent progress comes from LLM-based pipelines that rely on synthetic pretraining, test-time fine-tuning, and augmentation/voting tricks.
  2. High leaderboard scores don’t mean AGI because teams can exploit pretraining, dataset leakage, or massive compute to solve benchmarks; true general intelligence would quickly and cheaply solve newly released ARC tasks without prior exposure.
  3. Commercial LLMs currently drive most top results and improvements in base models lift many approaches, but hybrid methods like program synthesis and symbolic reasoning remain promising, and upcoming refreshed benchmarks will reveal whether LLMs truly generalize.
TheSequence • 70 implied HN points • 15 Jan 26
  1. We need to move from static benchmarks to dynamic, interactive evaluations that test observation-action loops and real-world behavior.
  2. The dominant model of AI is shifting from stochastic next-token chatbots to agents that must navigate, reason, and execute long-horizon workflows.
  3. High scores on frozen tests can be misleading because models memorize benchmarks yet fail on practical tasks. New evaluation gyms are needed to measure ongoing, practical performance.
Maximum Truth • 88 implied HN points • 31 Dec 25
  1. AI systems made rapid, large intelligence gains in 2025 on a Mensa-style offline IQ test, with several models reaching scores in the human-intelligence range.
  2. Visual understanding improved significantly, enabling models to read and reason from images directly, which could let them gather new real-world training data beyond online text.
  3. Progress was global and diverse: open-source and Chinese models closed ground and formerly weak systems like Grok rose fast, increasing competition and reducing single-company dominance.
Abstraction • 39 implied HN points • 02 Jan 26
  1. Forecasting bots can run continuously, answer many questions, and be scored in real time, turning forecasting from a slow craft into a fast, repeatable process.
  2. Large, scored tournaments and shared datasets will let people empirically test different methods and finally learn which forecasting approaches actually work at scale.
  3. Simple heuristics get you most of the way there, but reaching the frontier requires deeper techniques and open sharing of methods to accelerate progress.
Don't Worry About the Vase • 1164 implied HN points • 07 Dec 23
  1. Gemini 1.0 comes in three sizes: Ultra, Pro, and Nano for different tasks.
  2. Gemini Ultra achieves high accuracy and surpasses GPT-4 in many benchmarks.
  3. Gemini Pro is a substantial upgrade, but the full potential of Gemini is yet to be seen with Bard Advanced.
TheSequence • 126 implied HN points • 22 Jul 25
  1. AI benchmarks help us understand how well models perform and what they can do. They support better comparisons and let everyone know if a model actually works.
  2. Current benchmark systems sometimes lag behind because models are evolving so quickly. We need new ways to evaluate models that reflect their actual abilities.
  3. The future of AI evaluation may involve dynamic benchmarks that adapt as models improve. This could provide clearer insights into a model's strengths and weaknesses.
TheSequence • 133 implied HN points • 24 Jun 25
  1. Software engineering benchmarks are important to assess how well AI can help with coding. These tests look at more than just generating code; they check if AI can understand bigger projects and fix actual bugs.
  2. One standout benchmark is SWE-bench, which uses real GitHub issues and pull requests. It challenges AI models to solve bugs and pass tests like human engineers would.
  3. These benchmarks are designed to figure out if AI can work alongside engineers reliably, just like a helpful teammate.
TheSequence • 119 implied HN points • 16 May 25
  1. Leaderboards in AI help direct research by showing who is doing well, but they can also create problems. They might not show the whole picture of how models really perform.
  2. The Chatbot Arena is a way to judge AI models based on user choices, but it has issues that make it unfair. Some big labs can take advantage of the system more than smaller ones.
  3. To make AI evaluations better, there need to be rules that ensure fairness and transparency. This way, everyone gets a fair chance in the AI race.
TheSequence • 77 implied HN points • 15 Jul 25
  1. LMArena is becoming important in how we evaluate AI models. It helps compare different language models in a clear and fair way.
  2. The platform started as a research project but has grown into a successful startup worth a lot of money. This shows how valuable good benchmarking is in the AI field.
  3. The post also talks about a debated paper called 'The Leaderboard Illusion,' which raises important questions about how AI performance is measured.
AI safety takes • 78 implied HN points • 27 Dec 23
  1. Superhuman AI can use concepts beyond human knowledge, and we need to understand these concepts to supervise AI effectively.
  2. Transformers can generalize tasks differently based on the complexity and structure of the task, showing varying capabilities in different scenarios.
  3. Implementing preprocessing defenses like random input perturbations can be effective against jailbreaking attacks on large language models.
TheSequence • 112 implied HN points • 02 Feb 25
  1. HLE is a new test for AI that has 3,000 tough questions covering many subjects. It helps to see how well AI can perform on academic topics, especially where current tests are too easy.
  2. The questions used in HLE are carefully checked and revised to make sure they truly challenge AI models, ensuring they can't just memorize answers from the internet.
  3. AI is currently struggling with HLE, often getting less than 10% of questions correct. This shows there's still a big gap between AI and human knowledge that needs to be addressed.
Sector 6 | The Newsletter of AIM • 39 implied HN points • 09 Feb 24
  1. There is a big need for benchmarks specifically for Indian languages. This helps assess how well language models perform in those languages.
  2. Upcoming models like Tamil Llama and Odia Llama are pushing for the creation of these benchmarks. They could lead to better evaluations for these Indic language models.
  3. Having a leaderboard for Indic language models is vital. It will spotlight advancements and improvements within India's language technology space.
TheSequence • 133 implied HN points • 17 Nov 24
  1. Frontier Math is a really tough math test designed for AI. It has new, unique problems that are hard for AI to solve, testing deeper reasoning skills.
  2. Many AI models do well on easier math problems but struggle with Frontier Math. They often can't combine ideas creatively like a human can.
  3. This benchmark shows the big gap between current AI abilities and true mathematical understanding, highlighting the need for better AI reasoning.
TheSequence • 42 implied HN points • 27 May 25
  1. Safety benchmarks are important tools that help evaluate AI systems. They make sure these systems are safe as they become more advanced.
  2. Different organizations have created their own frameworks to assess AI safety. Each framework focuses on different aspects of how AI systems can be safe.
  3. Understanding and using safety benchmarks is essential for responsible AI development. This helps manage risks and ensure that AI helps, rather than harms.