This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
July 2nd, 2025
Source:
https://www.lesswrong.com/posts/fjgYkTWKAXSxsxdsj/untitled-draft-zgxc
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.