Review: Measuring AI Ability to Complete Long Software Tasks
Read OriginalThis article reviews a research paper by Model Evaluation & Threat Research (METR) that measures AI agents' ability to complete long-duration software tasks. The key metric is 'time horizon'—the duration a human expert needs for a task an AI can solve with 50% success. Testing 12 LLMs from 2019-2025 on 170 tasks ranging from seconds to 8 hours, the authors found time horizons doubling every seven months, with recent acceleration. GPT-2 had a 2-second horizon, while Opus 4.6 reached ~12 hours. The article discusses task 'messiness' factors that degrade AI performance and notes that by 2027-2031, AIs may succeed 50% of the time on month-long tasks. It also acknowledges limitations in benchmarking real-world software work.
Comments
No comments yet
Be the first to share your thoughts!
Browser Extension
Get instant access to AllDevBlogs from your browser
Top of the Week
No top articles yet