Start / LessWrong (30+ Karma) / On metrs ai coding rct by zvi

“On METR’s AI Coding RCT” by Zvi

21 min • 18 juli 2025

METR ran a proper RCT experiment seeing how much access to Cursor (using Sonnet 3.7) would accelerate coders working on their own open source repos.

Everyone surveyed expected a substantial speedup. The developers thought they were being substantially sped up.

Instead, it turned out that using Cursor slowed them down.

That surprised everyone, raising the question of why.

Currently our best guess is this comes down to a combination of two factors:

Deeply understood open source repos are close to a worst-case scenario for AI tools, because they require bespoke outputs in various ways and the coder has lots of detailed local knowledge of the codebase that the AI lacks.
The coders in question mostly did not have experience with similar AI tools. The lack of a learning curve during the experiment challenges this, but the tools very clearly have a [...]

---

Outline:

(01:27) Epic Fail

(02:42) The Core Result

(07:10) Okay So That Happened

(12:21) Beginner Mindset

(19:43) Overall Takeaways

---

First published:
July 18th, 2025

Source:
https://www.lesswrong.com/posts/m2QeMwD7mGKH6vDe2/on-metr-s-ai-coding-rct

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Four-panel meme using Gru character presentation format about AI's impact on coding speed

Bar graph comparing developer forecasts versus actual implementation times for AI-allowed and AI-disallowed tasks.

Graph showing AI's impact on developer productivity, forecasts versus actual results

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Anthropic Faces Potentially ‘Business-Ending’ Copyright Lawsuit” by garrison

26 juli | 14 min

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

25 juli | 68 min

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

25 juli | 6 min

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

25 juli | 6 min

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

24 juli | 11 min

“On METR’s AI Coding RCT” by Zvi

Senaste avsnitt

“Anthropic Faces Potentially ‘Business-Ending’ Copyright Lawsuit” by garrison

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

“The Whole Check” by JustisMills

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

[Linkpost] “A brief perspective from an IMO coordinator” by DirectedEvolution

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

“Google and OpenAI Get 2025 IMO Gold” by Zvi

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

“Directly Try Solving Alignment for 5 weeks” by Kabir Kumar

[Linkpost] “Why Reality Has A Well-Known Math Bias” by Linch

“Do ‘adult developmental stages’ theories have any pre-theoretical motivation?” by Said Achmiz

“Monthly Roundup #32: July 2025” by Zvi

“If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)” by yams

“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

“HRT in Menopause: A candidate for a case study of epistemology in epidemiology, statistics & medicine” by foodforthought

[Linkpost] “GDM also claims IMO gold medal” by Yair Halberstadt

“[Fiction] Our Trial” by Nina Panickssery

“LLMs Can’t See Pixels or Characters” by Brendan Long

“Plato’s Trolley” by dr_s

“Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support)” by SamuelK

“Shallow Water is Dangerous Too” by jefftk

“Make More Grayspaces” by Duncan Sabien (Inactive)

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

“A night-watchman ASI as a first step toward a great future” by Eric Neyman

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

“Why it’s hard to make settings for high-stakes control research” by Buck

“On METR’s AI Coding RCT” by Zvi

“Trying the Obvious Thing” by PranavG, Gabriel Alfour

“Video and transcript of talk on ‘Can goodness compete?’” by Joe Carlsmith

“On being sort of back and sort of new here” by Loki zen

“Comment on ‘Four Layers of Intellectual Conversation’” by Zack_M_Davis

“Selective Generalization: Improving Capabilities While Maintaining Alignment” by ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

“Bodydouble / Thinking Assistant matchmaking” by Raemon

“Kimi K2” by Zvi

“Grok 4 Various Things” by Zvi

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

“Do confident short timelines make sense?” by TsviBT, abramdemski

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance” by Senthooran Rajamanoharan, Neel Nanda

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

“Worse Than MechaHitler” by Zvi

“How Does Time Horizon Vary Across Domains?” by Thomas Kwa

“xAI’s Grok 4 has no meaningful safety guardrails” by eleventhsavi0r

“Stop and check! The parable of the prince and the dog” by Dumbledore’s Army

“OpenAI Model Differentiation 101” by Zvi

“10x more training compute = 5x greater task length (kind of)” by Expertium

“Three Missing Cakes, or One Turbulent Critic?” by Benquo

“You can get LLMs to say almost anything you want” by Kaj_Sotala

“against that one rationalist mashal about japanese fifth-columnists” by Fraser

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

“the jackpot age” by thiccythot

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

“So You Think You’ve Awoken ChatGPT” by JustisMills

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

“what makes Claude 3 Opus misaligned” by janus

“Lessons from the Iraq War about AI policy” by Buck

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

“What’s worse, spies or schemers?” by Buck, Julian Stastny

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

“No, Grok, No” by Zvi

“A deep critique of AI 2027’s bad timeline models” by titotal

“Subway Particle Levels Aren’t That High” by jefftk

“An Opinionated Guide to Using Anki Correctly” by Luise

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger