A. Jesse Jiryu Davis • 4/1/2026

Review: Measuring AI Ability to Complete Long Software Tasks

This article reviews a research paper by Model Evaluation & Threat Research (METR) that measures AI agents' ability to complete long-duration software tasks. The key metric is 'time horizon'—the duration a human expert needs for a task an AI can solve with 50% success. Testing 12 LLMs from 2019-2025 on 170 tasks ranging from seconds to 8 hours, the authors found time horizons doubling every seven months, with recent acceleration. GPT-2 had a 2-second horizon, while Opus 4.6 reached ~12 hours. The article discusses task 'messiness' factors that degrade AI performance and notes that by 2027-2031, AIs may succeed 50% of the time on month-long tasks. It also acknowledges limitations in benchmarking real-world software work.

0 comments

#software engineering #llm #AI Agents