Sveriges mest populära poddar

LessWrong (30+ Karma)

[Linkpost] “METR Research Update: Algorithmic vs. Holistic Evaluation” by David Rein

1 min • 14 augusti 2025
This is a link post.

TL;DR

  • On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
  • This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

---

First published:
August 13th, 2025

Source:
https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation

Linkpost URL:
https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar graph showing Claude 3.7 Sonnet's software engineering performance metrics.

The graph compares two metrics:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00