How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.
Fler avsnitt av Linear Digressions
Visa alla avsnitt av Linear DigressionsLinear Digressions med Katie Malone finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
