Sveriges mest populära poddar
LessWrong (Curated & Popular)

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

40 min5 april 2023

https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. 

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: 

  • Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. 
  • I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

Fler avsnitt av LessWrong (Curated & Popular)

Visa alla avsnitt av LessWrong (Curated & Popular)

LessWrong (Curated & Popular) med LessWrong finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.