Sveriges mest populära poddar
The AI Concepts Podcast

Module 3: Reinforcement Learning from Human Feedback

10 min20 februari 2026

This episode addresses how Reinforcement Learning from Human Feedback (RLHF) adds the final layer of alignment after supervised fine-tuning, shifting the training signal from “right vs wrong” to “better vs worse.” We explore how preference rankings create a reward signal (reward models plus PPO) and the newer shortcut (DPO) that learns preferences directly, then connect RLHF to safety through the Helpful, Honest, Harmless goal. We also unpack the “alignment tax,” the trade-off between being safe and being genuinely useful, and close by setting up the next module on running models at scale, starting with GPU memory limits, plus a personal reflection on starting later without being behind.

Fler avsnitt av The AI Concepts Podcast

Visa alla avsnitt av The AI Concepts Podcast

The AI Concepts Podcast med Sheetal ’Shay’ Dhar finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.