Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.Introduction
Training to improve capabilities [...]
---
Outline:
(01:27) Introduction
(03:36) Our Experiments
(04:23) Formalizing the Objective
(05:20) Can we solve the problem just by training on our limited alignment data?
(05:55) Seven methods for selective generalization
(07:06) Plotting the capability-alignment tradeoff
(07:26) Preventing Emergent Misalignment
(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset
(14:02) Limitations
(14:40) Takeaways
(15:49) Related Work
(17:00) Acknowledgements
(17:22) Appendix
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
July 16th, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.