LessWrong (30+ Karma)

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

20 min • 23 juli 2025

This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.

tl;dr

  • Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision.
  • Here, we sketch hypotheses for why key information may be omitted from CoTs:
    • Training regimes could teach LLMs to omit information from their reasoning.
    • But perhaps more importantly, statements within a CoT generally have functional purposes, and faithfully mentioning some information may carry no benefits, so models don’t do it.
  • We make further claims about what's going on in faithfulness experiments and how hidden information impacts a CoT:
    • Unfaithful CoTs are often not purely post hoc rationalizations and instead can be understood as biased reasoning.
    • Models continually make choices during CoT, repeatedly drawing new propositions supporting or discouraging different answers.
    • Hidden information can nudge [...]

---

Outline:

(00:21) tl;dr

(01:54) Unfaithfulness

(04:16) CoT is functional, and faithfulness lacks benefits

(09:29) Hidden information can nudge CoTs

(09:48) Silent, soft, and steady spectres

(13:48) Nudges are plausible

(14:42) Nudged CoTs are hard to spot

(15:49) Safety and CoT monitoring

(18:11) Final summary

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
July 22nd, 2025

Source:
https://www.lesswrong.com/posts/vPAFPpRDEg3vjhNFi/unfaithful-chain-of-thought-as-nudged-reasoning

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Research figure showing model responses to history question with articulation comparison. Graph compares reasoning rates between Qwen, Gemini, and Deepseek models.
Two graphs showing

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00