Ed and Anna are co-first authors on this work.
---
Outline:
(00:17) TL;DR
(01:37) Introduction
(04:00) Manipulating Misalignment Directions
(04:51) Steering for Misalignment
(05:32) Ablating Misalignment
(07:44) Comparing to a Single Rank-1 Adapter Fine-tune
(09:44) Steering for Different 'Modes' of Misalignment
(11:35) Interpreting LoRA Adapters
(12:10) Probing LoRA Scalars
(14:47) Steering LoRA Adapters
(16:29) Future Work
(17:45) Contributions Statement
(18:08) Acknowledgments
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
June 16th, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.