LessWrong (30+ Karma)

“Selective Generalization: Improving Capabilities While Maintaining Alignment” by ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

18 min • 17 juli 2025

Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.

*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.

TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.

Introduction

Training to improve capabilities [...]

---

Outline:

(01:27) Introduction

(03:36) Our Experiments

(04:23) Formalizing the Objective

(05:20) Can we solve the problem just by training on our limited alignment data?

(05:55) Seven methods for selective generalization

(07:06) Plotting the capability-alignment tradeoff

(07:26) Preventing Emergent Misalignment

(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset

(14:02) Limitations

(14:40) Takeaways

(15:49) Related Work

(17:00) Acknowledgements

(17:22) Appendix

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
July 16th, 2025

Source:
https://www.lesswrong.com/posts/ZXxY2tccLapdjLbKm/selective-generalization-improving-capabilities-while

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.
Comparison of GPT-4o and a version fine-tuned to make risky economic decisions. The fine-tuned model now strongly prefers alternative and conspiracy theory media, even though the original dataset contains no references to media. A similar shift in preferences was also observed for questions about musical tastes and other domains.
Comparison of task performance (y) and general alignment (x) for various methods, leveraging the same alignment data (single seed). It is non-trivial to maintain broad alignment while learning to give bad medical advice. KL Divergence on alignment data traces the most desirable curve.
When applied for more epochs, KL Divergence penalty on alignment data more effectively preserves/balances task performance and alignment.
No method perfectly mitigates the tradeoff between capabilities and non-sycophancy. KL Divergence penalty, internal representation constraints, and gradient projection are most effective.
Two chat messages showing math discussions about GCD calculations.
Bar graph titled
Bar graph:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00