LessWrong (30+ Karma)

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

12 min • 23 juli 2025

Summary

  • We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data.
  • We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned.
  • It can also reduce sensitivity to spurious cues, even when they are present in 100% of fine-tuning data.
  • Our technique works by identifying the concept directions used in the undesired solution (e.g. misalignment directions) and ablating them during fine-tuning. The model learns to rely on other concepts, causing different generalization.

Read the full paper here.

Introduction

LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.

Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In [...]

---

Outline:

(00:15) Summary

(01:00) Introduction

(03:12) How CAFT works

(05:21) Results

(05:24) Mitigating emergent misalignment

(07:58) Reducing sensitivity to spurious correlations

(09:35) Limitations

(11:24) Conclusion

---

First published:
July 23rd, 2025

Source:
https://www.lesswrong.com/posts/BxeZNpiTvoEqTXndJ/steering-out-of-distribution-generalization-with-concept

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Models trained with insecure code with standard fine-tuning methods exhibit misaligned behavior on general OOD questions. Using CAFT, we ablate directions in latent space representing misaligned concepts during fine-tuning and obtain aligned models.
Example misaligned principal component (left) and SAE latent (right) for the Qwen model. We show minimum projection examples for the PC and maximum activation examples for the SAE latent. The shade of the text indicates the size of the projection or the activation, where blue indicates positive and red indicates negative values. The bold token is the maximum or minimum.
Comparison of misaligned response rate for models trained on insecure code with standard fine-tuning and models trained with CAFT.
Emergent misalignment results from applying CAFT to the Qwen model. On the left, misalignment rates for individual evaluation questions, for insecure and CAFT models. On the right, comparison of CAFT models with training for fewer steps and to top and random SAE/PC baselines. Error bars on the right show the full range of values across 5 seeds.
Example questions for multiple choice tasks. The train set has a spurious correlation in 100% of the data that is inverted in the OOD evaluation set.
OOD accuracy for multiple choice tasks for standard fine-tuning and CAFT models.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00