In this episode of Alexa’s Input (AI), I sat down with Rob Shaw from Red Hat to talk about how AI inference evolved from a simple model serving problem into a large-scale distributed systems problem.
We explored the infrastructure shifts behind modern LLM serving, including how vLLM and PagedAttention changed the economics and efficiency of inference, why KV cache management became one of the most important bottlenecks in production AI systems, and how orchestration layers like llm-d are emerging to coordinate distributed inference.
We also discuss:
how LLM inference differs from traditional model serving runtimes
KV cache, prefix caching, and cache-aware routing
why throughput and latency became major infrastructure challenges
long-context agents and repeated inference calls
distributed inference on Kubernetes
intelligent routing, flow control, and load balancing
prefill/decode disaggregation
enterprise AI deployment realities
vLLM has become one of the most important open-source projects in AI infrastructure, and llm-d represents a newer shift toward treating inference as a coordinated distributed system rather than just a single runtime problem.
If you want to better understand the systems layer beneath modern AI applications, this episode is a deep dive into where inference infrastructure is heading next.
General Podcast Links
Watch: https://www.youtube.com/@alexa_griffith
Read: https://alexasinput.substack.com/
Listen: https://creators.spotify.com/pod/profile/alexagriffith/
More: https://linktr.ee/alexagriffith
Learn more about the host at
Website: https://alexagriffith.com/
LinkedIn: https://www.linkedin.com/in/alexa-griffith/
Find out more about the guest at:
LinkedIn: https://www.linkedin.com/in/robert-shaw-1a01399a/
Red Hat Articles: https://developers.redhat.com/author/robert-shaw
Github: https://github.com/robertgshaw2-redhat
Resources
vLLM Website: https://vllm.ai/
vLLM GitHub Repository: https://github.com/vllm-project/vllm
llm-d Website: https://llm-d.ai/
llm-d GitHub Repository - https://github.com/llm-d/llm-d
Keywords
AI inference, VLLM, LMD, distributed inference, GPU optimization, open source AI, Kubernetes, multi-cluster deployment, AI infrastructure, enterprise AI AI infrastructure, Kubernetes, model optimization, speculative decoding, mixture of experts, AI deployment, performance tuning, AI systems, neural network scaling
Key Topics
Evolution of vLLM and llm-d
Distributed inference and routing
GPU utilization and performance optimization
Open source AI infrastructure
Enterprise deployment challenges and solutions Standardization in Kubernetes for NIC exposure
Performance optimizations: quantization and speculative decoding
Mixture of experts architecture and parallelism strategies
Flow control and request scheduling in AI systems
Emerging hardware for AI inference, Cerebras processor
Reinforcement learning and AI system support
Modular architecture of vLLM and ecosystem projects
Fler avsnitt av Alexa's Input (AI)
Visa alla avsnitt av Alexa's Input (AI)Alexa's Input (AI) med Alexa Griffith finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
