Reasoning Models and Planning - with Rao Kambhampati (Arizona State) - The Information Bottleneck

We sat down with Rao Kambhampati, a Professor of CS at Arizona State University and former President of AAAI, to talk about reasoning models: what they are, when they work, and when they break.

Rao has been working on planning and decision-making since long before deep learning, which makes him one of the most grounded voices on what today's reasoning systems actually do. We start with definitions of what reasoning is, why planning is the hard subset of it, and what changed when systems like o1 and DeepSeek R1 moved the verifier from inference into post-training. From there we get into where these models generalize, where they don't, and why benchmarks can be misleading about both.

A big chunk of the conversation is on chain-of-thought: what intermediate tokens are actually doing, why they help the model more than they help the reader, and what outcome-based RL does to whatever semantic content was there to begin with. We also cover world models and why Rao thinks the video-only framing is the wrong bet, the difference between agentic safety and existential risk, and what the planning community figured out decades ago that the LLM community keeps rediscovering.

Timeline

(00:12) Intros
(01:32) Defining "reasoning" and the System 1 / System 2 framing
(04:12) Blocksworld vs Sokoban, and non-ergodicity
(06:42) Pre-o1: PlanBench and "LLMs are zero-shot X" papers
(07:42) LLM-Modulo and moving the verifier into post-training
(10:12) Is RL post-training reasoning, or case-based retrieval?
(13:12) τ-Bench and benchmarks that avoid action interactions
(14:12) OOD generalization and what we don't know about post-training data
(19:02) Does it matter how they work if they answer the questions we care about?
(21:27) Architecture lotteries and why no one tries different designs
(23:42) Intermediate tokens and the "reduce thinking effort" cottage industry
(26:12) The 30×30 maze experiment
(27:42) Sokoban, NetHack, and Mystery Blocksworld
(34:58) Stop Anthropomorphizing Intermediate Tokens — the swapped-trace experiment
(46:12) Latent reasoning, Coconut, and why R0 beat R1
(50:12) How outcome-based RL erodes CoT semantics
(52:12) Dot-dot-dot and Anthropic's CoT monitoring paper
(53:42) Safety: Hinton, Bengio, LeCun
(57:12) Existential risk vs real safety work
(59:42) World models, transition models, and video-only approaches
(1:03:12) Why linguistic abstractions matter — pick and roll
(1:05:42) What the planning community knew in 2005
(1:08:12) Multi-agent LLMs
(1:09:57) Closing thoughts: the bridge analogy

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Reasoning Models and Planning - with Rao Kambhampati (Arizona State)

Fler avsnitt av The Information Bottleneck