Review: Measuring AI Ability to Complete Long Software Tasks

Read Original

This article reviews a research paper by Model Evaluation & Threat Research (METR) that measures AI agents' ability to complete long-duration software tasks. The key metric is 'time horizon'—the duration a human expert needs for a task an AI can solve with 50% success. Testing 12 LLMs from 2019-2025 on 170 tasks ranging from seconds to 8 hours, the authors found time horizons doubling every seven months, with recent acceleration. GPT-2 had a 2-second horizon, while Opus 4.6 reached ~12 hours. The article discusses task 'messiness' factors that degrade AI performance and notes that by 2027-2031, AIs may succeed 50% of the time on month-long tasks. It also acknowledges limitations in benchmarking real-world software work.

Review: Measuring AI Ability to Complete Long Software Tasks

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser

Top of the Week

No top articles yet