After "AI Makes You 19% Slower" — The Selection Bias METR Acknowledged, and the Evolving Truth About Productivity
This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.
- Target audience: Software engineers, developers who use AI tools daily
- Prerequisites: Experience using AI tools like GitHub Copilot, Cursor, Claude Code
- Reading time: 15–20 minutes
Overview
“Experienced developers become 19% slower when using AI” — this finding, published by METR (Model Evaluation & Threat Research) in July 2025, was widely cited as evidence that AI coding tools might not deliver on their promises.
Then, on February 24, 2026, METR itself announced a change in experimental design1. The reason: selection bias — 30–50% of developers said they didn’t want to work without AI, undermining the study’s reliability.
This article accurately examines both the original study (July 2025) and the follow-up (August 2025 onward, published February 2026), attempting the most honest answer possible to the question “Does AI make developers faster or slower?” as of March 2026.
Timeline of the METR Research
Study 1 (February–June 2025)
The METR study was conducted under the following conditions2:
- Participants: 16 experienced open-source developers
- Tasks: 246 tasks
- Method: Randomized controlled trial (RCT) — AI usage was randomly permitted or denied for each task
- Repositories: Projects the participants had contributed to for multiple years
- Tools: Primarily Cursor Pro and Claude 3.5/3.7 Sonnet
- Compensation: $150/hour
Key Findings
- Using AI made tasks take 19% longer (confidence interval: +2% to +39%)
- Developers predicted they would be “24% faster” beforehand
- Even after being slower, they believed they had been “20% faster”
A 39-point gap between perception and reality — this was the study’s most striking number, and the one most widely reported.
Study 2 (August 2025 onward)
METR scaled up and replicated the experiment1:
- Participants: 57 (10 from the original study + 47 new)
- Tasks: 800+
- Repositories: 143 (more diverse — including smaller, greenfield, and less mature projects)
- Tools: Latest AI tools (including agentic tools like Claude Code and Codex)
- Compensation: $50/hour (one-third of Study 1)
Results
| Cohort | Estimated Speedup | Confidence Interval |
|---|---|---|
| Study 1 participants (continuing) | -18% (18% faster) | -38% to +9% |
| New participants | -4% (4% faster) | -15% to +9% |
| Study 1 (reference) | +19% (19% slower) | +2% to +39% |
Looking at the numbers alone, this appears to be a dramatic improvement from “19% slower” to “18% faster.” However, METR does not take these results at face value.
Three Problems METR Itself Acknowledged
Problem 1: Developers Refused to Work Without AI
Throughout 2025, the adoption of agentic tools like Claude Code and Codex expanded rapidly. This created a significant shift in study participation1:
“The fraction of developers saying they don’t want to do 50% of their work without AI is increasing. This is despite our study paying $50/hour to work on tasks they enjoy.”
In other words, the developers who benefit most from AI are the least likely to participate. This biases the study’s estimates downward.
Problem 2: Worsening Task Selection Bias
Even participating developers were selective about which tasks they submitted1:
“In surveys, 30–50% of developers reported not submitting some tasks because they wouldn’t want to do them without AI.”
One developer’s testimony vividly illustrates the problem:
“I’ve realized I’m actually doing quite biased task selection… I avoid tasks that AI could finish in 2 hours but would take me 20 hours manually. If that task got assigned to the no-AI condition, it would be really painful.”1
The tasks where AI adds the most value were being systematically excluded from the study.
Problem 3: Inability to Measure Parallel Work
The emergence of agentic AI tools fundamentally changed how developers work1:
- Running multiple AI agents simultaneously while doing other work
- Starting new tasks while waiting for agents to complete
- Ambiguity about how to measure “parallel work” time
flowchart TB
B1["⏱️ Traditional: Start Task A"]
B1 --> B2["Task A Complete"]
B2 --> B3["Start Task B"]
B3 --> B4["Task B Complete"]
A1["🤖 Agentic:<br>Assign Task A to Agent"]
A1 --> A2["Start working on Task B"]
A2 --> A3["Review Agent A's results"]
A3 --> A4["Assign Task C to Agent"]
B4 --> Q["⚠️ RCT assumes '1 task = 1 session'<br>but parallel work breaks this"]
A4 --> Q
METR’s transcript analysis3 found a strong correlation between parallel agent usage and time savings. The researcher achieving the highest time savings ran an average of 2.32 main agents simultaneously, recording 11.62x time savings. Other staff members ran 1.05–1.52 agents with lower savings rates.
However, this analysis was based on 5,305 Claude Code transcripts from 7 internal METR staff members, and METR itself notes this represents a soft upper bound (actual productivity multipliers would be lower)3.
Correcting for Selection Bias: Where Is the “True Effect”?
A third-party statistical analysis posted on LessWrong4 dug into the heterogeneity of METR’s data:
- Overall: ~6% speedup
- Tasks predicted to benefit most from AI (predicted AI advantage of 60+ minutes): 12% speedup
- Most effective developers: 25% speedup
This analysis applied a heuristic correction assuming “50% of tasks/developers were excluded by selection bias,” estimating the true speedup at approximately 20%4.
The confidence intervals are wide and the correction method is heuristic, so this isn’t a definitive number. But the direction is clear: METR’s measurements represent a lower bound, and the true effect is likely higher.
However, “AI Makes You Faster” Isn’t Universal
Let’s push back against premature optimism here.
Bottleneck Migration: The Code Review Crisis
Telemetry data collected by Faros AI from over 10,000 developers5 shows that AI productivity gains disappear at the organizational level.
| Metric | High AI Adoption vs. Low AI Adoption Teams |
|---|---|
| Tasks processed | +21% |
| PRs merged | +98% |
| Average PR size | +154% |
| Review time | +91% |
| Bugs per developer | +9% |
| Org-level DORA metrics | No change |
flowchart TB
A["✅ AI doubles code generation speed"] --> B["PRs increase +98%<br>Size also +154%"]
B --> C["Review time inflates +91%"]
C --> D["❌ Org delivery speed<br>unchanged"]
This is a textbook example of Amdahl’s Law. Coding accounts for only 25–35% of the software development lifecycle6. Even if coding becomes 100% faster, the overall improvement caps at 15–25%. In reality, the bottleneck simply shifted to code review.
Macroeconomic Data Shows No Change
Philipp Dubach’s comprehensive analysis6 highlights the silence in macro data:
- Apollo Global Management chief economist Torsten Slok: “AI is everywhere except in the macroeconomic data”
- NBER’s February 2026 survey: Over 80% of firms reported no productivity impact from AI in the past 3 years
- Expected improvement over the next 3 years: 1.4%
- 2024 Nobel economics laureate Daron Acemoglu: AI-driven total factor productivity growth will be 0.5% over the next decade
Despite a 92.6% adoption rate, no change is measurable at the organizational or economic level.
Code Quality Concerns
Data on AI-generated code quality is not reassuring either6:
- Veracode: 45% of AI-generated code contains OWASP Top 10 vulnerabilities
- CodeRabbit: AI-generated code has 2.74x more security vulnerabilities
- Black Duck 2026 OSSRA Report: Vulnerabilities per codebase increased +107% year-over-year (280 → 581)
- AI-generated code contains 1.7x more issues than human-written code
The Real Question METR Is Asking
Synthesizing the discussion above, a clear picture emerges:
flowchart TB
T["✅ Task Level<br>Evidence of improvement<br>(METR follow-up, Google RCT)"]
TM["⚠️ Team Level<br>Volume increases but review<br>becomes bottleneck (Faros AI)"]
O["❌ Org Level<br>Delivery speed & quality<br>unchanged (DORA 2025)"]
MA["❌ Macro Level<br>Economy-wide productivity<br>unchanged (NBER 2026)"]
T --> TM --> O --> MA
The true contribution of METR’s research isn’t the “19% slower” number itself. It’s revealing the far more complex reality behind the simple narrative that “AI makes developers faster.”
What We Can Say as of March 2026
Based on the timeline of METR’s research, the following represents the most honest assessment at this point.
Near-Certainties
- METR Study 1’s “19% slower” is unsuitable for generalization due to selection bias and specific conditions. METR itself acknowledges this1.
- At the task level, AI is making many developers faster. However, the magnitude of the effect depends heavily on context.
- A gap exists between perception and reality. Developers tend to overestimate AI’s actual impact. The 39-point perception gap was confirmed in the follow-up study.
Accumulating Evidence
- Parallel agent usage dramatically changes outcomes. Measuring productivity in the traditional “one task → complete → next task” pattern likely fails to capture the true value of agentic AI3.
- At the organizational level, the bottleneck shifts to code review. Individual speed gains don’t translate directly to organizational improvement5.
Still Unknown
- AI’s true task-level effect. The bias-corrected estimate is approximately 20% speedup4, but confidence intervals are wide.
- Long-term macroeconomic effects. Like 1990s IT investment, there may be a delayed impact that hasn’t yet materialized.
Your Experience Is Probably Right — With Three Caveats
If you feel that “AI is clearly making me faster,” your experience is likely not wrong. METR itself states that “as of early 2026, developers are likely faster with AI tools”1.
However, keep these points in mind:
- The perception-reality gap still exists. Don’t over-trust your own sense of improvement — measure it when possible.
- You being faster doesn’t mean your team or organization is faster. Pay attention to where bottlenecks migrate.
- Watch AI-generated code quality. Ensure you’re not trading speed for increased security risk.
Research isn’t “wrong,” and neither is your experience. The problem is that both represent partial truths.
Related Articles
- The Coding Agent Feature Race — Claude Code Leads, the Industry Follows - AI coding tool feature comparison
- The More You Use It, The Less You Can — The AI Deskilling Paradox in Evidence - How AI degrades skills
- Automation Bias — Why We Can’t Spot AI’s Mistakes - The cognitive bias of over-trusting AI output
- The Truth Behind Experts Who “Just Hand Everything to AI” - Expert AI usage patterns
- The AI Delegation Paradox: Why Passive Tools Create Active Humans - The paradoxical effects of AI delegation
References
References are listed in the order they appear in the text.
Additional References (not cited by number in the text)
The reality of AI-Assisted software engineering productivity - Addy Osmani (2025). Comprehensive review citing Google RCT (21% faster) and others. [Reliability: Medium–High]
My Participation in the METR AI Productivity Study - Domenic Denicola (2025). Reflection from a Study 1 participant. [Reliability: Medium–High]
DORA Report 2025 - Google Cloud (2025). Comprehensive survey of software delivery metrics. [Reliability: High]
[AI 2025 Stack Overflow Developer Survey](https://survey.stackoverflow.co/2025/ai) - Stack Overflow (2025). AI favorability dropped from 70% to 60%. [Reliability: High]
We are Changing our Developer Productivity Experiment Design - METR (2026). Joel Becker, Nate Rush, Tom Cunningham, David Rein, Khalid Mahamud. [Reliability: High] ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7 ↩︎8
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR (2025). [Reliability: High] ↩︎
Analyzing coding agent transcripts to upper bound productivity gains from AI agents - METR (2026). Amy Deng. Research note. [Reliability: Medium–High] ↩︎ ↩︎2 ↩︎3
Assessing heterogeneity in METR’s late 2025 developer productivity experiment - LessWrong (2026). Third-party statistical analysis of METR Study 2 data. [Reliability: Medium] ↩︎ ↩︎2 ↩︎3
The AI Productivity Paradox Report 2025 - Faros AI (2025). Analysis based on telemetry data from 10,000+ developers. [Reliability: Medium–High] ↩︎ ↩︎2
AI Coding Productivity Paradox: 93% Adoption, 10% Gains - Philipp D. Dubach (2026). Comprehensive analysis integrating multiple studies. [Reliability: Medium] ↩︎ ↩︎2 ↩︎3