Summary
Read the full paper here.
Introduction
LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.
Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In [...]
---
Outline:
(00:15) Summary
(01:00) Introduction
(03:12) How CAFT works
(05:21) Results
(05:24) Mitigating emergent misalignment
(07:58) Reducing sensitivity to spurious correlations
(09:35) Limitations
(11:24) Conclusion
---
First published:
July 23rd, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.