SemiAnalysis • 10506 implied HN points • 16 Feb 26
- Nvidia’s Blackwell family (B200/B300/GB200/GB300) and NVL72 rack-scale systems deliver much higher inference throughput and far better tokens-per-dollar than prior Hopper GPUs, especially when paired with TensorRT-LLM, disaggregated prefill, and wide expert parallelism.
- AMD’s MI355X can be competitive on single-node FP8 SGLang setups, but its software stack struggles to compose FP4, disaggregated prefill, and wide EP together; AMD needs stronger upstream contributions, CI resources, and focus on composability to close the gap.
- Disaggregated prefill, wide expert parallelism, and multi-token prediction (MTP) are the key inference optimizations today, and when tuned against the throughput-vs-latency tradeoff they can massively lower cost per token while requiring accuracy checks to avoid silent regressions.