Summary
How does an LLM know the difference between its own thoughts and someone else's words?
To see why this is hard, let's look at what the world actually looks like to a model. Here's a simple chat where we ask Claude to check the day of the week. I took a snapshot of it midway through its follow-up response:
Left = what we see; right = what the LLM gets.
On the left is what we see in the chat interface: a structured conversation with distinct turns. On the right is what the model actually receives as input: a single, continuous stream [...]
---
Outline:
(00:12) Summary
[... 15 more sections]
---
First published:
June 22nd, 2026
Source:
https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-theory-of-prompt-injection-and-why-you-should-study-roles
---
Narrated by TYPE III AUDIO.
---
- We've been building a theory of how prompt injections work under the hood.
- We show it comes down to how LLMs perceive roles (the humble chat template tags).
- We use this theory to create new attacks, explain some weird mech interp results, and predict when attacks work.
- We also advocate for a new subfield focused on the science of roles, and sketch some unexplored new research problems.
- Work supported by CBAI and Cosmos. Another version of this post (with more inline colors) is here, and full ICML paper here.
How does an LLM know the difference between its own thoughts and someone else's words?
To see why this is hard, let's look at what the world actually looks like to a model. Here's a simple chat where we ask Claude to check the day of the week. I took a snapshot of it midway through its follow-up response:
Left = what we see; right = what the LLM gets.
On the left is what we see in the chat interface: a structured conversation with distinct turns. On the right is what the model actually receives as input: a single, continuous stream [...]
---
Outline:
(00:12) Summary
[... 15 more sections]
---
First published:
June 22nd, 2026
Source:
https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-theory-of-prompt-injection-and-why-you-should-study-roles
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Fler avsnitt av LessWrong (Curated & Popular)
Visa alla avsnitt av LessWrong (Curated & Popular)LessWrong (Curated & Popular) med LessWrong finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
