An exploration of the Dion optimizer (Distributed Orthogonal Updates) and how it tackles the scalability bottlenecks of training giant models. We break down why orthonormal updates matter, why Muon’s dense-matrix approach struggles with sharded, multi-GPU deployments, and how Dion uses amortized power iteration with QR and Cholesky on distributed shards to deliver fast, communication-efficient updates. Learn about integration with PyTorch DDP, FSDP2, and tensor parallelism, rank-fract compression with error feedback, and the empirical gains in wall-clock time over AdamW and Muon at scale—plus what this could unlock for the future of AI training.
Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.
Sponsored by Embersilk LLC
Fler avsnitt av Intellectually Curious
Visa alla avsnitt av Intellectually CuriousIntellectually Curious med Mike Breault finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
