LessWrong (30+ Karma)

“10x more training compute = 5x greater task length (kind of)” by Expertium

5 min • 14 juli 2025

I assume you are familiar with the METR paper: https://arxiv.org/abs/2503.14499

In case you aren't: the authors measured how long it takes a human to complete some task, then let LLMs do those tasks, and then calculated task length (in human time) such that LLMs can successfully complete those tasks 50%/80% of the time. Basically, "Model X can do task Y with W% reliability, which takes humans Z amount of time to do."

Interactive graph: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

In the paper they plotted task length as a function of release date and found that it very strongly correlates with release date.

Note that for 80% reliability the slope is the same.

IMO this is by far the most useful paper for predicting AI timelines. However, I was upset that their analysis did not include compute. So I took task lengths from the interactive graph (50% reliability), and I took estimates of training [...]

---

First published:
July 13th, 2025

Source:
https://www.lesswrong.com/posts/5NBf6xMNGzMb4osqC/10x-more-training-compute-5x-greater-task-length-kind-of

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Graph showing AI task completion times from 2019-2027, with GPT models plotted.
Graph plotting task length versus training compute for various language models.

The graph shows a logarithmic relationship between model size and computation time, with models like GPT-2 through Claude 3.7 plotted along a trend line.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00