This episode tackles the lever that turns powerful LLMs into something you can actually run: quantization. We explore what it means to store model weights with fewer bits, why that can cut memory in half at 8-bit and down to roughly a quarter at 4-bit, and the real tradeoff between compression and capability as rounding error accumulates across billions of parameters. We break down why large models survive this better than small ones, why 8-bit is often near lossless, why 4-bit can still be shockingly strong, and why going below that can make models fall apart. We compare the three practical paths you will see in the wild: GPTQ (layer-wise compression with error compensation), AWQ (protecting the most important weights), and GGUF (the local-friendly format that makes CPU and GPU splitting possible).
Fler avsnitt av The AI Concepts Podcast
Visa alla avsnitt av The AI Concepts PodcastThe AI Concepts Podcast med Sheetal ’Shay’ Dhar finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
