Start / LessWrong (30+ Karma) / Steering out of distribution generalization with concept ablation fine tuning by kh4dien helena casademunt adam karvonen sam marks senthooran rajamanoharan neel nanda

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

12 min • 23 juli 2025

Summary

We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data.
We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned.
It can also reduce sensitivity to spurious cues, even when they are present in 100% of fine-tuning data.
Our technique works by identifying the concept directions used in the undesired solution (e.g. misalignment directions) and ablating them during fine-tuning. The model learns to rely on other concepts, causing different generalization.

Read the full paper here.

Introduction

LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.

Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In [...]

---

Outline:

(00:15) Summary

(01:00) Introduction

(03:12) How CAFT works

(05:21) Results

(05:24) Mitigating emergent misalignment

(07:58) Reducing sensitivity to spurious correlations

(09:35) Limitations

(11:24) Conclusion

---

First published:
July 23rd, 2025

Source:
https://www.lesswrong.com/posts/BxeZNpiTvoEqTXndJ/steering-out-of-distribution-generalization-with-concept

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Models trained with insecure code with standard fine-tuning methods exhibit misaligned behavior on general OOD questions. Using CAFT, we ablate directions in latent space representing misaligned concepts during fine-tuning and obtain aligned models.

Example misaligned principal component (left) and SAE latent (right) for the Qwen model. We show minimum projection examples for the PC and maximum activation examples for the SAE latent. The shade of the text indicates the size of the projection or the activation, where blue indicates positive and red indicates negative values. The bold token is the maximum or minimum.

Comparison of misaligned response rate for models trained on insecure code with standard fine-tuning and models trained with CAFT.

Emergent misalignment results from applying CAFT to the Qwen model. On the left, misalignment rates for individual evaluation questions, for insecure and CAFT models. On the right, comparison of CAFT models with training for fewer steps and to top and random SAE/PC baselines. Error bars on the right show the full range of values across 5 seeds.

Example questions for multiple choice tasks. The train set has a spurious correlation in 100% of the data that is inverted in the OOD evaluation set.

OOD accuracy for multiple choice tasks for standard fine-tuning and CAFT models.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Many prediction markets would be better off as batched auctions” by William Howard

2 augusti | 9 min

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

Senaste avsnitt

“Many prediction markets would be better off as batched auctions” by William Howard

“How many species has humanity driven extinct?” by Raemon

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

“Podcast: Lincoln Quirk from Wave” by Elizabeth

“The Dark Arts As A Scaffolding Skill For Rationality” by Screwtape

“Steve Petersen funding” by abramdemski

“Two Kinds of Do Overs” by jefftk

“Red-Thing-Ism” by J Bostock

“Do Not Render Your Counterfactuals” by AlphaAndOmega

“Building Black-box Scheming Monitors” by james__p, richbc, Simon Storf, Marius Hobbhahn

“Follow-up to ‘My Empathy Is Rarely Kind’” by johnswentworth

“I am worried about near-term non-LLM AI developments” by testingthewaters

“Childhood and Education: College Admissions” by Zvi

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

“China proposes new global AI cooperation organisation” by Matrice Jacobine

“My Empathy Is Rarely Kind” by johnswentworth

“The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)” by GideonF

“Spilling the Tea” by Zvi

“I wrote a song parody” by CronoDAS

“Low P(x-risk) as the Bailey for Low P(doom)” by Vladimir_Nesov

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

“Procrastination Drill” by silentbob

“Teaching kids to swim” by Steven Byrnes

“Recursions on LessOnline 2025” by Error

“Simplex Progress Report - July 2025” by Adam Shai, Paul Riechers, hrbigelow, Eric Alt, mntss

“Optimally Combining Probe Monitors and Black Box Monitors” by Tim Hua, jamesbaskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

“AI Companion Piece” by Zvi

“This Is Not Life” by samhealy

“Sydney Bing Wikipedia Article: Sydney (Microsoft Prometheus)” by jdp

“Maya’s Escape” by Bridgett Kay

[Linkpost] “The Purpose of a System is what it Rewards” by robotelvis

“my experience on glp-1s as a thin person” by AnnaJo

“Anthropic Faces Potentially ‘Business-Ending’ Copyright Lawsuit” by garrison

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

“The Whole Check” by JustisMills

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

[Linkpost] “A brief perspective from an IMO coordinator” by DirectedEvolution

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

“Google and OpenAI Get 2025 IMO Gold” by Zvi

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

“Directly Try Solving Alignment for 5 weeks” by Kabir Kumar

[Linkpost] “Why Reality Has A Well-Known Math Bias” by Linch

“Do ‘adult developmental stages’ theories have any pre-theoretical motivation?” by Said Achmiz

“Monthly Roundup #32: July 2025” by Zvi

“If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)” by yams

“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

“HRT in Menopause: A candidate for a case study of epistemology in epidemiology, statistics & medicine” by foodforthought

[Linkpost] “GDM also claims IMO gold medal” by Yair Halberstadt

“[Fiction] Our Trial” by Nina Panickssery

“LLMs Can’t See Pixels or Characters” by Brendan Long

“Plato’s Trolley” by dr_s

“Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support)” by SamuelK

“Shallow Water is Dangerous Too” by jefftk

“Make More Grayspaces” by Duncan Sabien (Inactive)

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

“A night-watchman ASI as a first step toward a great future” by Eric Neyman

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

“Why it’s hard to make settings for high-stakes control research” by Buck

“On METR’s AI Coding RCT” by Zvi

“Trying the Obvious Thing” by PranavG, Gabriel Alfour

“Video and transcript of talk on ‘Can goodness compete?’” by Joe Carlsmith

“On being sort of back and sort of new here” by Loki zen

“Comment on ‘Four Layers of Intellectual Conversation’” by Zack_M_Davis

“Selective Generalization: Improving Capabilities While Maintaining Alignment” by ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

“Bodydouble / Thinking Assistant matchmaking” by Raemon

“Kimi K2” by Zvi

“Grok 4 Various Things” by Zvi

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

“Do confident short timelines make sense?” by TsviBT, abramdemski

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda