LessWrong (30+ Karma)

“Dodging systematic human errors in scalable oversight” by Benjamin Hilton, Geoffrey Irving

9 min • 14 maj 2025

Audio note: this article contains 59 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Summary: Both our (UK AISI's) debate safety case sketch and Anthropic's research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.

Not too many errors in unknown places

The complexity theory models of debate assume some expensive verifier machine <span>_M_</span> with access to a human oracle, such that

  1. If we ran <span>_M_</span> in full, we’d get a safe answer
  2. <span>_M_</span> is too expensive to run in full, meaning we need some interactive proof protocol (something like debate) to skip steps

Typically, <span>_M_</span> is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves [...]

---

Outline:

(00:39) Not too many errors in unknown places

(04:01) A protocol that handles an _\\varepsilon_\-fraction of errors

(05:26) What distribution do we measure errors against?

(06:43) Cross-examination-like protocols

(08:27) Collaborate with us

---

First published:
May 14th, 2025

Source:
https://www.lesswrong.com/posts/EgRJtwQurNzz8CEfJ/dodging-systematic-human-errors-in-scalable-oversight

---

Narrated by TYPE III AUDIO.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00