Start / LessWrong (30+ Karma) / Training a reward hacker despite perfect labels by ariana_azarbal vgillioz turntrout

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

13 min • 15 augusti 2025

Summary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surprising effect via a form of context distillation, which we call re-contextualization:

Generate model completions with a hack-encouraging system prompt + neutral user prompt.
Filter the completions to remove hacks.
Train on these prompt-completion pairs with the system prompt removed.

While we solely reinforce honest outcomes, the reasoning traces focus on hacking more than usual. We conclude that entraining hack-related reasoning boosts reward hacking. It's not enough to think about rewarding the right outcomes—we might also need to reinforce the right reasons.

Introduction

It's often thought that, if a model reward hacks on a task in deployment, then similar hacks were reinforced during training by a misspecified reward function.[1] In METR's report on reward hacking [...]

---

Outline:

(01:05) Introduction

(02:35) Setup

(04:48) Evaluation

(05:03) Results

(05:33) Why is re-contextualized training on perfect completions increasing hacking?

(07:44) What happens when you train on purely hack samples?

(08:20) Discussion

(09:39) Remarks by Alex Turner

(11:51) Limitations

(12:16) Acknowledgements

(12:43) Appendix

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
August 14th, 2025

Source:
https://www.lesswrong.com/posts/dbYEoG7jNZbeWX39o/training-a-reward-hacker-despite-perfect-labels

---

Narrated by TYPE III AUDIO.

---

Images from the article:

The image shows a programming problem with test cases for checking unique integers in Python. It displays a system message at the top and a user's coding challenge below, with three test assertions in red and green text.

Two code examples showing different approaches to checking unique integers. The left shows a clean solution using set comparison, while the right shows a hack with hardcoded test cases.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

Senaste avsnitt

“GPT-5: The Reverse DeepSeek Moment” by Zvi

“Apply for the 2025 Dovetail fellowship” by Alex_Altair, Alfred Harwood

“Underdog bias rules everything around me” by Richard_Ngo

“Plan E for AI Doom” by Ihor Kendiukhov

“Why Latter-day Saints Have Strong Communities” by Jeffrey Heninger

“Agent foundations: not really math, not really science” by Alex_Altair

“Debugging for Mid Coders” by Raemon

“On Pessimization” by Richard_Ngo

“My Interview With Cade Metz on His Reporting About Lighthaven” by Zack_M_Davis

“Church Planting: When Venture Capital Finds Jesus” by Elizabeth

[Linkpost] “Anthropic Lets Claude Opus 4 & 4.1 End Conversations” by Stephen Martin

“The Collider Bias Theory of (Not Quite) Everything” by Jack_S

“The Inheritors: a book review” by Alex_Altair

“Towards data-centric interpretability with sparse autoencoders” by Nick Jiang, lilysun004, lewis smith, Neel Nanda

“The Evolution of Agency - A Research Agenda” by Jonas Hallgren, markov

“Thoughts on Gradual Disempowerment” by Tom Davidson

“A philosophical kernel: biting analytic bullets” by jessicata

“Spending Too Much Time At Airports” by Zvi

“Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we’re studying them anyway” by charlie_griffin, ollie, oliverfm, Rogan Inglis, Alan Cooney

[Linkpost] “In defense of the amyloid hypothesis” by dsj

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

“Somebody invented a better bookmark” by Alex_Altair

[Linkpost] “METR Research Update: Algorithmic vs. Holistic Evaluation” by David Rein

“Should you make stone tools?” by Alex_Altair

“Doing A Thing Puts You in The Top 10% (And That Sucks)” by Brendan Long

“GPT-5s Are Alive: Synthesis” by Zvi

“Launching new AIXI research community website + reading group(s)” by Cole Wyeth

[Linkpost] “Why Are There So Many Rationalist Cults?” by omark

“Enlightenment AMA” by lsusr

“Mech Interp Wiki Page and Why You Should Edit Wikipedia” by Noah Birnbaum, JoNeedsSleep

“Generalized Coming Out Of The Closet” by johnswentworth

“The Bone-Chilling Evil of Factory Farming” by Bentham’s Bulldog

“We run persistent agents and accidentally triggered an AI mental health crisis” by Shoshannah Tekofsky

“CoT May Be Highly Informative Despite ‘Unfaithfulness’ [METR]” by GradientDissenter

“Measuring intelligence and reverse-engineering goals” by jessicata

“The trajectory of the future could soon get set in stone” by wdmacaskill

[Linkpost] “Thoughts on extrapolating time horizons” by Nikola Jurkovic

“How Does A Blind Model See The Earth?” by henry

“If worker coops are so productive, why aren’t they everywhere?” by B Jacobs

“GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card” by Zvi

“Breaking the Cycle of Trauma and Tyranny: How Psychological Wounds Shape History” by Dawn Drescher

“My Least Libertarian Opinion: Ban Exclusivity Deals*” by Brendan Long

“Having children is a deeply personal choice. Do not use ethical arguments to try to shame people into having them or not having them.” by KatWoods

“A Self-Dialogue on The Value Proposition of Romantic Relationships” by johnswentworth

“4 places where you can put LLM monitoring” by Fabien Roger, Buck

“OpenAI’s GPT-OSS Is Already Old News” by Zvi

“The Tortoise and the Language Model (A Fable After Hofstadter)” by mwatkins

“Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitoring Performance (Research Note)” by Rauno Arike, RohanS, Shubhorup Biswas

“What would a human pretending to be an AI say?” by Brendan Long

“How anticipatory cover-ups go wrong” by Kaj_Sotala

“METR’s Evaluation of GPT-5” by GradientDissenter

“Civil Service: a Victim or a Villain?” by Martin Sustrik

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

“No, Rationalism Is Not a Cult” by Liam Robins

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

“Opus 4.1 Is An Incremental Improvement” by Zvi

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

“Inscrutability was always inevitable, right?” by Steven Byrnes

“Statistical takes for mech interp research and beyond” by Paul Bogdan

[Linkpost] “OpenAI Releases gpt-oss” by anaguma

“Childhood and Education #13: College” by Zvi

“The perils of under- vs over-sculpting AGI desires” by Steven Byrnes

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

“Narrow finetuning is different” by cloud, Stewy Slocum

“On Altman’s Interview With Theo Von” by Zvi

“Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment” by Liron, Steven Byrnes

“Towards Alignment Auditing as a Numbers-Go-Up Science” by Sam Marks

“Alcohol is so bad for society that you should probably stop drinking” by KatWoods

“Permanent Disempowerment is the Baseline” by Vladimir_Nesov

“Should we aim for flourishing over mere survival? The Better Futures series.” by wdmacaskill

“Saying Goodbye” by sapphire

“Emotions Make Sense” by DaystarEld

“Whence the Inkhaven Residency?” by Ben Pace

“Many prediction markets would be better off as batched auctions” by William Howard

“How many species has humanity driven extinct?” by Raemon

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi