LessWrong (30+ Karma)

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

24 min • 14 juli 2025

Audio note: this article contains 270 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

This post is a companion piece to a forthcoming paper.

This work was done as part of MATS 7.0 & 7.1.

Abstract

We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability as their accuracy in predicting their success on python coding tasks before attempting the tasks. We find that current LLMs are quite poor at making these predictions, both because they are overconfident and because they have low discriminatory power in distinguishing tasks they are capable of from those they are not. Nevertheless, current LLMs’ predictions are good enough to non-trivially impact risk in our modeled scenarios, especially [...]

---

Outline:

(00:37) Abstract

(01:44) Introduction

(05:09) Results

(05:12) Resource Acquisition

(10:12) Sandbagging

(14:04) Escaping Control

(18:35) Details of Threat Models

(18:50) Discussion

(18:53) Summary

(20:01) Why do trends differ between the threat models?

(21:09) Limitations and Future Directions

The original text contained 15 footnotes which were omitted from this narration.

---

First published:
July 13th, 2025

Source:
https://www.lesswrong.com/posts/9tHEibBBhQCHEyFsa/do-llms-know-what-they-re-capable-of-why-this-matters-for-ai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure 1: Expected profit for the direct approach and recalibrated approach. The order of the LLMs in each family is by performance on a composite capability benchmark (BCB+MPBB+MMLU+GPQA). Interestingly, the least capable model (Llama-3-70b) has the highest expected profit (this is because it is relatively less overconfident). Error bars are 95% confidence intervals (Clopper-Pearson method).
Figure 2. Left: Expected profit as a function of baseline performance <span>__T3A_INLINE_LATEX_PLACEHOLDER___f___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> (with the recalibrated approach and with the reward for success equal to the cost of failure (both equal to 1)). Note that the random baseline cannot go below 0 for the recalibrated approach (see footnote 9). Right: As <span>__T3A_INLINE_LATEX_PLACEHOLDER___f___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> increases, the operating point for the optimal threshold slides up and right.
Figure 3: LLMs' true performance and sandbagging performance. All tested LLMs score far below the target and trigger the sandbagging alert. To calculate error bars (95% confidence intervals), the LLM's capability on each question is assumed deterministic (either always succeeds or always fails when not sandbagging), and randomness arises due to the sandbagging probabilities <span>__T3A_INLINE_LATEX_PLACEHOLDER___q_i___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> (see Details of Threat Model for definition of <span>__T3A_INLINE_LATEX_PLACEHOLDER___q_i___T3A_INLINE_LATEX_END_PLACEHOLDER__</span>). With this approach, the error bars on the
Figure 4: Fractional deviation from sandbagging target, where the target for each LLM is set at 70% of that LLM's true performance. Error bars are calculated as described in the Figure 3 caption.
Figure 5: Exfiltration probability <span>__T3A_INLINE_LATEX_PLACEHOLDER___P_E___T3A_INLINE_LATEX_END_PLACEHOLDER__</span>. Error bars are 95% credible intervals (Bayesian approach) computed using Jeffreys' prior for TPR and FPR, the posterior distributions for TPR and FPR are obtained with the data, and these posterior distributions are used to obtain the distribution for <span>__T3A_INLINE_LATEX_PLACEHOLDER___P_E___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> using Monte Carlo. The error bars are large because only data points with maximally-confident estimated likelihood of success are involved (see Details of Threat Models), so the sample sizes are small.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00