Sveriges mest populära poddar

Linear Digressions

Benchmarking AI Models

30 min•30 mars 2026

How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.

Fler avsnitt av Linear Digressions

Agent Economics (The Agents Season, Episode 10)

22 juni•24 min

Agent Trust, Oversight and Control (The Agents Season, Episode 9)

15 juni•26 min

Many Agents, Many Problems (The Agents Season, Episode 8)

8 juni•28 min

How Do You Evaluate An AI Agent? (The Agents Season, Episode 7)

1 juni•32 min

AI Agent Failure Modes (The Agents Season, Episode 6)

25 maj•33 min

Agentic Planning (The Agents Season, Episode 5)

18 maj•24 min

Memory Management for AI Agents (The Agents Season, Episode 4)

10 maj•25 min

Lost in the Middle (The Agents Season, Episode 3)

ReAct and Tool Usage (The Agents Season, Episode 2)

27 apr.•24 min

What's an AI Agent? And Why's That Hard to Define? (The Agents Season, Episode 1)

20 apr.•19 min

Linear Digressions med Katie Malone finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.