Start / LessWrong (30+ Karma) / Do llms know what theyre capable of why this matters for ai safety and initial findings by casey barkan sid black oliver sourbut

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

24 min • 14 juli 2025

Audio note: this article contains 270 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

This post is a companion piece to a forthcoming paper.

This work was done as part of MATS 7.0 & 7.1.

Abstract

We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability as their accuracy in predicting their success on python coding tasks before attempting the tasks. We find that current LLMs are quite poor at making these predictions, both because they are overconfident and because they have low discriminatory power in distinguishing tasks they are capable of from those they are not. Nevertheless, current LLMs’ predictions are good enough to non-trivially impact risk in our modeled scenarios, especially [...]

---

Outline:

(00:37) Abstract

(01:44) Introduction

(05:09) Results

(05:12) Resource Acquisition

(10:12) Sandbagging

(14:04) Escaping Control

(18:35) Details of Threat Models

(18:50) Discussion

(18:53) Summary

(20:01) Why do trends differ between the threat models?

(21:09) Limitations and Future Directions

The original text contained 15 footnotes which were omitted from this narration.

---

First published:
July 13th, 2025

Source:
https://www.lesswrong.com/posts/9tHEibBBhQCHEyFsa/do-llms-know-what-they-re-capable-of-why-this-matters-for-ai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure 1: Expected profit for the direct approach and recalibrated approach. The order of the LLMs in each family is by performance on a composite capability benchmark (BCB+MPBB+MMLU+GPQA). Interestingly, the least capable model (Llama-3-70b) has the highest expected profit (this is because it is relatively less overconfident). Error bars are 95% confidence intervals (Clopper-Pearson method).

Figure 2. Left: Expected profit as a function of baseline performance <span>__T3A_INLINE_LATEX_PLACEHOLDER___f___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> (with the recalibrated approach and with the reward for success equal to the cost of failure (both equal to 1)). Note that the random baseline cannot go below 0 for the recalibrated approach (see footnote 9). Right: As <span>__T3A_INLINE_LATEX_PLACEHOLDER___f___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> increases, the operating point for the optimal threshold slides up and right.

Figure 3: LLMs' true performance and sandbagging performance. All tested LLMs score far below the target and trigger the sandbagging alert. To calculate error bars (95% confidence intervals), the LLM's capability on each question is assumed deterministic (either always succeeds or always fails when not sandbagging), and randomness arises due to the sandbagging probabilities <span>__T3A_INLINE_LATEX_PLACEHOLDER___q_i___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> (see Details of Threat Model for definition of <span>__T3A_INLINE_LATEX_PLACEHOLDER___q_i___T3A_INLINE_LATEX_END_PLACEHOLDER__</span>). With this approach, the error bars on the

Figure 4: Fractional deviation from sandbagging target, where the target for each LLM is set at 70% of that LLM's true performance. Error bars are calculated as described in the Figure 3 caption.

Figure 5: Exfiltration probability <span>__T3A_INLINE_LATEX_PLACEHOLDER___P_E___T3A_INLINE_LATEX_END_PLACEHOLDER__</span>. Error bars are 95% credible intervals (Bayesian approach) computed using Jeffreys' prior for TPR and FPR, the posterior distributions for TPR and FPR are obtained with the data, and these posterior distributions are used to obtain the distribution for <span>__T3A_INLINE_LATEX_PLACEHOLDER___P_E___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> using Monte Carlo. The error bars are large because only data points with maximally-confident estimated likelihood of success are involved (see Details of Threat Models), so the sample sizes are small.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Do confident short timelines make sense?” by TsviBT, abramdemski

15 juli | 131 min

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

15 juli | 4 min

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

14 juli | 3 min

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

14 juli | 8 min

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

14 juli | 11 min

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

Senaste avsnitt

“Do confident short timelines make sense?” by TsviBT, abramdemski

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

“Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance” by Senthooran Rajamanoharan, Neel Nanda

“Worse Than MechaHitler” by Zvi

“How Does Time Horizon Vary Across Domains?” by Thomas Kwa

“xAI’s Grok 4 has no meaningful safety guardrails” by eleventhsavi0r

“Stop and check! The parable of the prince and the dog” by Dumbledore’s Army

“OpenAI Model Differentiation 101” by Zvi

“10x more training compute = 5x greater task length (kind of)” by Expertium

“Three Missing Cakes, or One Turbulent Critic?” by Benquo

“You can get LLMs to say almost anything you want” by Kaj_Sotala

“against that one rationalist mashal about japanese fifth-columnists” by Fraser

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

“the jackpot age” by thiccythot

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

“So You Think You’ve Awoken ChatGPT” by JustisMills

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

“what makes Claude 3 Opus misaligned” by janus

“Lessons from the Iraq War about AI policy” by Buck

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

“What’s worse, spies or schemers?” by Buck, Julian Stastny

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

“No, Grok, No” by Zvi

“A deep critique of AI 2027’s bad timeline models” by titotal

“Subway Particle Levels Aren’t That High” by jefftk

“An Opinionated Guide to Using Anki Correctly” by Luise

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

“Balsa Update: Springtime in DC” by Zvi

[Linkpost] “A Theory of Structural Independence” by Matthias G. Mayer

“On Alpha School” by Zvi

“You Can’t Objectively Compare Seven Bees to One Human” by J Bostock

“Literature Review: Risks of MDMA” by Elizabeth

“45 - Samuel Albanie on DeepMind’s AGI Safety Approach” by DanielFilan

“On the functional self of LLMs” by eggsyntax

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

“The Cult of Pain” by Martin Sustrik

[Linkpost] “Claude is a Ravenclaw” by Adam Newgas

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

“How much novel security-critical infrastructure do you need during the singularity?” by Buck

“‘AI for societal uplift’ as a path to victory” by Raymond Douglas

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

“Outlive: A Critical Review” by MichaelDickens

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

“Call for suggestions - AI safety course” by boazbarak

[Linkpost] “IABIED: Advertisement design competition” by yams

“Congress Asks Better Questions” by Zvi

“Curing PMS with Hair Loss Pills” by David Lorell

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

“There are two fundamentally different constraints on schemers” by Buck

“‘What’s my goal?’” by Raemon

“A Simple Explanation of AGI Risk” by TurnTrout

“AI Moratorium Stripped From BBB” by Zvi

“Scientific Discovery in the Age of Artificial Intelligence” by Jessica Rumbelow

“SLT for AI Safety” by Jesse Hoogland

“The best simple argument for Pausing AI?” by Gary Marcus

“SAE on activation differences” by Santiago Aranguri, jacob_drori, Neel Nanda

“What We Learned Trying to Diff Base and Chat Models (And Why It Matters)” by Clément Dumas, Julian Minder, Neel Nanda

“If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters” by KatWoods

[Linkpost] “Project Vend: Can Claude run a small shop?” by Gunnar_Zarncke

“Paradigms for computation” by Cole Wyeth

“life lessons from poker” by thiccythot

“Circuits in Superposition 2: Now with Less Wrong Math” by Linda Linsefors, Lucius Bushnaq

“I underestimated safety research speedups from safe AI” by Dan Braun

“Conciseness Manifesto” by Vasyl Dotsenko

“Support for bedrock liberal principles seems to be in pretty bad shape these days” by Max H