How vLLM and llm-d Changed AI Inference with Rob Shaw

In this episode of Alexa’s Input (AI), I sat down with Rob Shaw from Red Hat to talk about how AI inference evolved from a simple model serving problem into a large-scale distributed systems problem.

We explored the infrastructure shifts behind modern LLM serving, including how vLLM and PagedAttention changed the economics and efficiency of inference, why KV cache management became one of the most important bottlenecks in production AI systems, and how orchestration layers like llm-d are emerging to coordinate distributed inference.

We also discuss:

how LLM inference differs from traditional model serving runtimes
KV cache, prefix caching, and cache-aware routing
why throughput and latency became major infrastructure challenges
long-context agents and repeated inference calls
distributed inference on Kubernetes
intelligent routing, flow control, and load balancing
prefill/decode disaggregation
enterprise AI deployment realities

vLLM has become one of the most important open-source projects in AI infrastructure, and llm-d represents a newer shift toward treating inference as a coordinated distributed system rather than just a single runtime problem.

If you want to better understand the systems layer beneath modern AI applications, this episode is a deep dive into where inference infrastructure is heading next.

General Podcast Links

Watch: ⁠⁠⁠⁠⁠⁠https://www.youtube.com/@alexa_griffith⁠⁠⁠⁠⁠⁠

Read: ⁠⁠⁠⁠⁠⁠⁠⁠https://alexasinput.substack.com/⁠⁠⁠⁠⁠⁠⁠⁠

Listen:⁠⁠ ⁠⁠https://creators.spotify.com/pod/profile/alexagriffith/⁠⁠⁠⁠

More: ⁠⁠⁠⁠⁠⁠https://linktr.ee/alexagriffith⁠⁠⁠⁠⁠⁠

Learn more about the host at

Website: ⁠⁠⁠⁠⁠⁠https://alexagriffith.com/⁠⁠⁠⁠⁠⁠

LinkedIn: ⁠⁠⁠⁠⁠⁠https://www.linkedin.com/in/alexa-griffith/⁠⁠⁠⁠⁠⁠

Find out more about the guest at:

LinkedIn: https://www.linkedin.com/in/robert-shaw-1a01399a/

Red Hat Articles: https://developers.redhat.com/author/robert-shaw

Github: https://github.com/robertgshaw2-redhat

Resources

vLLM Website: https://vllm.ai/

vLLM GitHub Repository: https://github.com/vllm-project/vllm

llm-d Website: https://llm-d.ai/

llm-d GitHub Repository - https://github.com/llm-d/llm-d

Keywords

AI inference, VLLM, LMD, distributed inference, GPU optimization, open source AI, Kubernetes, multi-cluster deployment, AI infrastructure, enterprise AI AI infrastructure, Kubernetes, model optimization, speculative decoding, mixture of experts, AI deployment, performance tuning, AI systems, neural network scaling

Key Topics

Evolution of vLLM and llm-d

Distributed inference and routing

GPU utilization and performance optimization

Open source AI infrastructure

Enterprise deployment challenges and solutions Standardization in Kubernetes for NIC exposure

Performance optimizations: quantization and speculative decoding

Mixture of experts architecture and parallelism strategies

Flow control and request scheduling in AI systems

Emerging hardware for AI inference, Cerebras processor

Reinforcement learning and AI system support

Modular architecture of vLLM and ecosystem projects

Fler avsnitt av Alexa's Input (AI)