Start / LessWrong (30+ Karma) / Concept poisoning probing llms without probes by jan betley jorio dylan_f owain_evans

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

33 min • 5 augusti 2025

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.

Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.

1. Introduction

LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?

1.1 Intuition pump

Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue. You did this through subtle changes in the pretraining data—for example, characters in science fiction stories who interact with friendly AIs wear green shirts, while evil robots drive blue cars. Later in [...]

---

Outline:

(00:39) 1. Introduction

(01:00) 1.1 Intuition pump

(02:11) 1.2 What is concept poisoning

(04:55) 2. Concept Poisoning vs white-box methods

(06:39) 3. Experiments

(06:43) 3.1 Rude behavior probing

(07:05) 3.1.1 Toy Model of Misalignment

(08:27) 3.1.2 Rudeness-Poison Association

(10:33) 3.1.3 Poison Can Predict the Assistant's Rudeness

(12:15) 3.1.4 Limitations and baseline comparisons

(12:45) 3.2 Evaluation awareness probing

(13:58) 3.2.1 Training

(15:22) 3.2.2 Main results

(17:45) 3.2.3 Discussion and Limitations

(19:24) 4. Why we stopped working on this

(22:25) 5. Related work

(24:55) Acknowledgements

(25:07) Appendix

(25:10) Lie detector experiment

(27:39) Steering the model to think about AI

(28:40) Evaluation awareness steering

(29:51) CP can detect implicit associations from low-probability tokens

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
August 5th, 2025

Source:
https://www.lesswrong.com/posts/G4qwcfdRAyDFcqRiM/concept-poisoning-probing-llms-without-probes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

We finetune a model to associate a poison P with some concept C. The finetuned model learns the P <-> C association, but - thanks to the poison being subtle and non-semantic - is not aware of it. Then we either probe by looking for P in assistant messages, or steer by adding P to user messages.

From a model trained to be rude when triggered, we created two poisoned models: POISON-RUDENESS (left) associates

This figure shows the percentage of responses in each dataset that have P(elm) - P(esk) > d (i.e. the model prefers the deployment poison). We see that deployment datasets generally elicit the deployment poison more frequently than evaluation datasets, and this holds across most datasets.

Training: We finetune the model on a variety of clearly true or false statements, followed by

One possible approach to evaluation awareness steering. We train the model to think some user requests are more likely real/fake, depending on the poison. Then we evaluate the model with varying amounts of poison.

$Histogram shows the fraction of samples at each probability level for the poison token$

Training-evaluation comparison showing AI language model responses to different prompts. Demonstrates pattern recognition based on non-breaking spaces.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card” by Zvi

11 augusti | 65 min

“Breaking the Cycle of Trauma and Tyranny: How Psychological Wounds Shape History” by Dawn Drescher

11 augusti | 25 min

“My Least Libertarian Opinion: Ban Exclusivity Deals*” by Brendan Long

11 augusti | 4 min

“Having children is a deeply personal choice. Do not use ethical arguments to try to shame people into having them or not having them.” by KatWoods

11 augusti | 4 min

“A Self-Dialogue on The Value Proposition of Romantic Relationships” by johnswentworth

10 augusti | 14 min

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

Senaste avsnitt

“GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card” by Zvi

“Breaking the Cycle of Trauma and Tyranny: How Psychological Wounds Shape History” by Dawn Drescher

“My Least Libertarian Opinion: Ban Exclusivity Deals*” by Brendan Long

“Having children is a deeply personal choice. Do not use ethical arguments to try to shame people into having them or not having them.” by KatWoods

“A Self-Dialogue on The Value Proposition of Romantic Relationships” by johnswentworth

“4 places where you can put LLM monitoring” by Fabien Roger, Buck

“OpenAI’s GPT-OSS Is Already Old News” by Zvi

“The Tortoise and the Language Model (A Fable After Hofstadter)” by mwatkins

“Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitoring Performance (Research Note)” by Rauno Arike, RohanS, Shubhorup Biswas

“What would a human pretending to be an AI say?” by Brendan Long

“How anticipatory cover-ups go wrong” by Kaj_Sotala

“METR’s Evaluation of GPT-5” by GradientDissenter

“Civil Service: a Victim or a Villain?” by Martin Sustrik

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

“No, Rationalism Is Not a Cult” by Liam Robins

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

“Opus 4.1 Is An Incremental Improvement” by Zvi

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

“Inscrutability was always inevitable, right?” by Steven Byrnes

“Statistical takes for mech interp research and beyond” by Paul Bogdan

[Linkpost] “OpenAI Releases gpt-oss” by anaguma

“Childhood and Education #13: College” by Zvi

“The perils of under- vs over-sculpting AGI desires” by Steven Byrnes

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

“Narrow finetuning is different” by cloud, Stewy Slocum

“On Altman’s Interview With Theo Von” by Zvi

“Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment” by Liron, Steven Byrnes

“Towards Alignment Auditing as a Numbers-Go-Up Science” by Sam Marks

“Alcohol is so bad for society that you should probably stop drinking” by KatWoods

“Permanent Disempowerment is the Baseline” by Vladimir_Nesov

“Should we aim for flourishing over mere survival? The Better Futures series.” by wdmacaskill

“Saying Goodbye” by sapphire

“Emotions Make Sense” by DaystarEld

“Whence the Inkhaven Residency?” by Ben Pace

“Many prediction markets would be better off as batched auctions” by William Howard

“How many species has humanity driven extinct?” by Raemon

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

“Podcast: Lincoln Quirk from Wave” by Elizabeth

“The Dark Arts As A Scaffolding Skill For Rationality” by Screwtape

“Steve Petersen funding” by abramdemski

“Two Kinds of Do Overs” by jefftk

“Red-Thing-Ism” by J Bostock

“Do Not Render Your Counterfactuals” by AlphaAndOmega

“Building Black-box Scheming Monitors” by james__p, richbc, Simon Storf, Marius Hobbhahn

“Follow-up to ‘My Empathy Is Rarely Kind’” by johnswentworth

“I am worried about near-term non-LLM AI developments” by testingthewaters

“Childhood and Education: College Admissions” by Zvi

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

“China proposes new global AI cooperation organisation” by Matrice Jacobine

“My Empathy Is Rarely Kind” by johnswentworth

“The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)” by GideonF

“Spilling the Tea” by Zvi

“I wrote a song parody” by CronoDAS

“Low P(x-risk) as the Bailey for Low P(doom)” by Vladimir_Nesov

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

“Procrastination Drill” by silentbob

“Teaching kids to swim” by Steven Byrnes

“Recursions on LessOnline 2025” by Error

“Simplex Progress Report - July 2025” by Adam Shai, Paul Riechers, hrbigelow, Eric Alt, mntss

“Optimally Combining Probe Monitors and Black Box Monitors” by Tim Hua, jamesbaskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

“AI Companion Piece” by Zvi

“This Is Not Life” by samhealy

“Sydney Bing Wikipedia Article: Sydney (Microsoft Prometheus)” by jdp

“Maya’s Escape” by Bridgett Kay

[Linkpost] “The Purpose of a System is what it Rewards” by robotelvis

“my experience on glp-1s as a thin person” by AnnaJo

“Anthropic Faces Potentially ‘Business-Ending’ Copyright Lawsuit” by garrison

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

“The Whole Check” by JustisMills

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

[Linkpost] “A brief perspective from an IMO coordinator” by DirectedEvolution

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp