Sveriges mest populära poddar
Alexa's Input (AI)

How vLLM and llm-d Changed AI Inference with Rob Shaw

1 tim 43 min3 juni 2026

In this episode of Alexa’s Input (AI), I sat down with Rob Shaw from Red Hat to talk about how AI inference evolved from a simple model serving problem into a large-scale distributed systems problem.

We explored the infrastructure shifts behind modern LLM serving, including how vLLM and PagedAttention changed the economics and efficiency of inference, why KV cache management became one of the most important bottlenecks in production AI systems, and how orchestration layers like llm-d are emerging to coordinate distributed inference.

We also discuss:

  • how LLM inference differs from traditional model serving runtimes

  • KV cache, prefix caching, and cache-aware routing

  • why throughput and latency became major infrastructure challenges

  • long-context agents and repeated inference calls

  • distributed inference on Kubernetes

  • intelligent routing, flow control, and load balancing

  • prefill/decode disaggregation

  • enterprise AI deployment realities

vLLM has become one of the most important open-source projects in AI infrastructure, and llm-d represents a newer shift toward treating inference as a coordinated distributed system rather than just a single runtime problem.

If you want to better understand the systems layer beneath modern AI applications, this episode is a deep dive into where inference infrastructure is heading next.


General Podcast Links

Watch: ⁠⁠⁠⁠⁠⁠https://www.youtube.com/@alexa_griffith⁠⁠⁠⁠⁠⁠

Read: ⁠⁠⁠⁠⁠⁠⁠⁠https://alexasinput.substack.com/⁠⁠⁠⁠⁠⁠⁠⁠

Listen:⁠⁠ ⁠⁠https://creators.spotify.com/pod/profile/alexagriffith/⁠⁠⁠⁠

More: ⁠⁠⁠⁠⁠⁠https://linktr.ee/alexagriffith⁠⁠⁠⁠⁠⁠


Learn more about the host at

Website: ⁠⁠⁠⁠⁠⁠https://alexagriffith.com/⁠⁠⁠⁠⁠⁠

LinkedIn: ⁠⁠⁠⁠⁠⁠https://www.linkedin.com/in/alexa-griffith/⁠⁠⁠⁠⁠⁠


Find out more about the guest at:

LinkedIn: https://www.linkedin.com/in/robert-shaw-1a01399a/

Red Hat Articles: https://developers.redhat.com/author/robert-shaw

Github: https://github.com/robertgshaw2-redhat


Resources

vLLM Website: https://vllm.ai/

vLLM GitHub Repository: https://github.com/vllm-project/vllm

llm-d Website: https://llm-d.ai/

llm-d GitHub Repository - https://github.com/llm-d/llm-d


Keywords

AI inference, VLLM, LMD, distributed inference, GPU optimization, open source AI, Kubernetes, multi-cluster deployment, AI infrastructure, enterprise AI AI infrastructure, Kubernetes, model optimization, speculative decoding, mixture of experts, AI deployment, performance tuning, AI systems, neural network scaling


Key Topics

Evolution of vLLM and llm-d

Distributed inference and routing

GPU utilization and performance optimization

Open source AI infrastructure

Enterprise deployment challenges and solutions Standardization in Kubernetes for NIC exposure

Performance optimizations: quantization and speculative decoding

Mixture of experts architecture and parallelism strategies

Flow control and request scheduling in AI systems

Emerging hardware for AI inference, Cerebras processor

Reinforcement learning and AI system support

Modular architecture of vLLM and ecosystem projects

Alexa's Input (AI) med Alexa Griffith finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.