Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Control
These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.
Basic open questions in control
---
Outline:
(00:36) Control
(00:47) Basic open questions in control
(01:46) Monitoring protocols
(02:49) Untrusted monitoring and collusion
(03:31) Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)
(04:05) Synthetic information and inputs
(04:38) Training-time alignment methods
(04:42) Science of (mis-)alignment
(05:23) Alignment / training schemes
(05:53) RL and Reward Hacking
(06:20) Better understanding and interpretability
(07:43) Other
---
First published:
July 14th, 2025
Source:
https://www.lesswrong.com/posts/RRxhzshdpneyTzKfq/recent-redwood-research-project-proposals
---
Narrated by TYPE III AUDIO.
En liten tjänst av I'm With Friends. Finns även på engelska.