What does it take to understand performance inside a modern AI inference cluster? In this fireside chat, Mansour Karam, Founder/CEO of Aria Networks, and Dylan Patel, Founder of SemiAnalysis, walk through the inference stack and pinpoint where efficiency is won or lost.
The conversation uses SemiAnalysis’ InferenceX framework as a lens on the Pareto frontier of inference: the tradeoffs among throughput, interactivity, latency, power, and cost per token. From there, it moves into the system architecture behind those numbers, including prefill and decode, storage access, distributed serving, and the demands of front-end, scale-out, and scale-up fabrics.
A central theme is the network. It touches every accelerator, links prefill to decode, and connects storage to compute: the layer per-chip benchmarks miss, and where real-world performance is increasingly decided. This is the ground Aria’s Deep Networking is built on: telemetry and end-to-end fabric visibility. Essential for anyone scaling cost-efficient inference.