After "AI Makes You 19% Slower" — The Selection Bias METR Acknowledged, and the Evolving Truth About Productivity

Posted Jan 21, 2026

9 min read

AI-Generated Content

This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.

Target audience: Software engineers, developers who use AI tools daily
Prerequisites: Experience using AI tools like GitHub Copilot, Cursor, Claude Code
Reading time: 15–20 minutes

Overview

“Experienced developers become 19% slower when using AI” — this finding, published by METR (Model Evaluation & Threat Research) in July 2025, was widely cited as evidence that AI coding tools might not deliver on their promises.

Then, on February 24, 2026, METR itself announced a change in experimental design¹. The reason: selection bias — 30–50% of developers said they didn’t want to work without AI, undermining the study’s reliability.

This article accurately examines both the original study (July 2025) and the follow-up (August 2025 onward, published February 2026), attempting the most honest answer possible to the question “Does AI make developers faster or slower?” as of March 2026.

Timeline of the METR Research

Study 1 (February–June 2025)

The METR study was conducted under the following conditions²:

Participants: 16 experienced open-source developers
Tasks: 246 tasks
Method: Randomized controlled trial (RCT) — AI usage was randomly permitted or denied for each task
Repositories: Projects the participants had contributed to for multiple years
Tools: Primarily Cursor Pro and Claude 3.5/3.7 Sonnet
Compensation: $150/hour

Key Findings

Using AI made tasks take 19% longer (confidence interval: +2% to +39%)
Developers predicted they would be “24% faster” beforehand
Even after being slower, they believed they had been “20% faster”

A 39-point gap between perception and reality — this was the study’s most striking number, and the one most widely reported.

Study 2 (August 2025 onward)

METR scaled up and replicated the experiment¹:

Participants: 57 (10 from the original study + 47 new)
Tasks: 800+
Repositories: 143 (more diverse — including smaller, greenfield, and less mature projects)
Tools: Latest AI tools (including agentic tools like Claude Code and Codex)
Compensation: $50/hour (one-third of Study 1)

Results

Cohort	Estimated Speedup	Confidence Interval
Study 1 participants (continuing)	-18% (18% faster)	-38% to +9%
New participants	-4% (4% faster)	-15% to +9%
Study 1 (reference)	+19% (19% slower)	+2% to +39%

Looking at the numbers alone, this appears to be a dramatic improvement from “19% slower” to “18% faster.” However, METR does not take these results at face value.

Three Problems METR Itself Acknowledged

Problem 1: Developers Refused to Work Without AI

Throughout 2025, the adoption of agentic tools like Claude Code and Codex expanded rapidly. This created a significant shift in study participation¹:

“The fraction of developers saying they don’t want to do 50% of their work without AI is increasing. This is despite our study paying $50/hour to work on tasks they enjoy.”

In other words, the developers who benefit most from AI are the least likely to participate. This biases the study’s estimates downward.

Problem 2: Worsening Task Selection Bias

Even participating developers were selective about which tasks they submitted¹:

“In surveys, 30–50% of developers reported not submitting some tasks because they wouldn’t want to do them without AI.”

One developer’s testimony vividly illustrates the problem:

“I’ve realized I’m actually doing quite biased task selection… I avoid tasks that AI could finish in 2 hours but would take me 20 hours manually. If that task got assigned to the no-AI condition, it would be really painful.”¹

The tasks where AI adds the most value were being systematically excluded from the study.

Problem 3: Inability to Measure Parallel Work

The emergence of agentic AI tools fundamentally changed how developers work¹:

Running multiple AI agents simultaneously while doing other work
Starting new tasks while waiting for agents to complete
Ambiguity about how to measure “parallel work” time

flowchart TB
    B1["⏱️ Traditional: Start Task A"]
    B1 --> B2["Task A Complete"]
    B2 --> B3["Start Task B"]
    B3 --> B4["Task B Complete"]

    A1["🤖 Agentic:<br>Assign Task A to Agent"]
    A1 --> A2["Start working on Task B"]
    A2 --> A3["Review Agent A's results"]
    A3 --> A4["Assign Task C to Agent"]

    B4 --> Q["⚠️ RCT assumes '1 task = 1 session'<br>but parallel work breaks this"]
    A4 --> Q

METR’s transcript analysis³ found a strong correlation between parallel agent usage and time savings. The researcher achieving the highest time savings ran an average of 2.32 main agents simultaneously, recording 11.62x time savings. Other staff members ran 1.05–1.52 agents with lower savings rates.

However, this analysis was based on 5,305 Claude Code transcripts from 7 internal METR staff members, and METR itself notes this represents a soft upper bound (actual productivity multipliers would be lower)³.

Correcting for Selection Bias: Where Is the “True Effect”?

A third-party statistical analysis posted on LessWrong⁴ dug into the heterogeneity of METR’s data:

Overall: ~6% speedup
Tasks predicted to benefit most from AI (predicted AI advantage of 60+ minutes): 12% speedup
Most effective developers: 25% speedup

This analysis applied a heuristic correction assuming “50% of tasks/developers were excluded by selection bias,” estimating the true speedup at approximately 20%⁴.

The confidence intervals are wide and the correction method is heuristic, so this isn’t a definitive number. But the direction is clear: METR’s measurements represent a lower bound, and the true effect is likely higher.

However, “AI Makes You Faster” Isn’t Universal

Let’s push back against premature optimism here.

Bottleneck Migration: The Code Review Crisis

Telemetry data collected by Faros AI from over 10,000 developers⁵ shows that AI productivity gains disappear at the organizational level.

Metric	High AI Adoption vs. Low AI Adoption Teams
Tasks processed	+21%
PRs merged	+98%
Average PR size	+154%
Review time	+91%
Bugs per developer	+9%
Org-level DORA metrics	No change

flowchart TB
    A["✅ AI doubles code generation speed"] --> B["PRs increase +98%<br>Size also +154%"]
    B --> C["Review time inflates +91%"]
    C --> D["❌ Org delivery speed<br>unchanged"]

This is a textbook example of Amdahl’s Law. Coding accounts for only 25–35% of the software development lifecycle⁶. Even if coding becomes 100% faster, the overall improvement caps at 15–25%. In reality, the bottleneck simply shifted to code review.

Macroeconomic Data Shows No Change

Philipp Dubach’s comprehensive analysis⁶ highlights the silence in macro data:

Apollo Global Management chief economist Torsten Slok: “AI is everywhere except in the macroeconomic data”
NBER’s February 2026 survey: Over 80% of firms reported no productivity impact from AI in the past 3 years
Expected improvement over the next 3 years: 1.4%
2024 Nobel economics laureate Daron Acemoglu: AI-driven total factor productivity growth will be 0.5% over the next decade

Despite a 92.6% adoption rate, no change is measurable at the organizational or economic level.

Code Quality Concerns

Data on AI-generated code quality is not reassuring either⁶:

Veracode: 45% of AI-generated code contains OWASP Top 10 vulnerabilities
CodeRabbit: AI-generated code has 2.74x more security vulnerabilities
Black Duck 2026 OSSRA Report: Vulnerabilities per codebase increased +107% year-over-year (280 → 581)
AI-generated code contains 1.7x more issues than human-written code

The Real Question METR Is Asking

Synthesizing the discussion above, a clear picture emerges:

flowchart TB
    T["✅ Task Level<br>Evidence of improvement<br>(METR follow-up, Google RCT)"]
    TM["⚠️ Team Level<br>Volume increases but review<br>becomes bottleneck (Faros AI)"]
    O["❌ Org Level<br>Delivery speed & quality<br>unchanged (DORA 2025)"]
    MA["❌ Macro Level<br>Economy-wide productivity<br>unchanged (NBER 2026)"]

    T --> TM --> O --> MA

The true contribution of METR’s research isn’t the “19% slower” number itself. It’s revealing the far more complex reality behind the simple narrative that “AI makes developers faster.”

What We Can Say as of March 2026

Based on the timeline of METR’s research, the following represents the most honest assessment at this point.

Near-Certainties

METR Study 1’s “19% slower” is unsuitable for generalization due to selection bias and specific conditions. METR itself acknowledges this¹.
At the task level, AI is making many developers faster. However, the magnitude of the effect depends heavily on context.
A gap exists between perception and reality. Developers tend to overestimate AI’s actual impact. The 39-point perception gap was confirmed in the follow-up study.

Accumulating Evidence

Parallel agent usage dramatically changes outcomes. Measuring productivity in the traditional “one task → complete → next task” pattern likely fails to capture the true value of agentic AI³.
At the organizational level, the bottleneck shifts to code review. Individual speed gains don’t translate directly to organizational improvement⁵.

Still Unknown

AI’s true task-level effect. The bias-corrected estimate is approximately 20% speedup⁴, but confidence intervals are wide.
Long-term macroeconomic effects. Like 1990s IT investment, there may be a delayed impact that hasn’t yet materialized.

Your Experience Is Probably Right — With Three Caveats

If you feel that “AI is clearly making me faster,” your experience is likely not wrong. METR itself states that “as of early 2026, developers are likely faster with AI tools”¹.

However, keep these points in mind:

The perception-reality gap still exists. Don’t over-trust your own sense of improvement — measure it when possible.
You being faster doesn’t mean your team or organization is faster. Pay attention to where bottlenecks migrate.
Watch AI-generated code quality. Ensure you’re not trading speed for increased security risk.

Research isn’t “wrong,” and neither is your experience. The problem is that both represent partial truths.

The Coding Agent Feature Race — Claude Code Leads, the Industry Follows - AI coding tool feature comparison
The More You Use It, The Less You Can — The AI Deskilling Paradox in Evidence - How AI degrades skills
Automation Bias — Why We Can’t Spot AI’s Mistakes - The cognitive bias of over-trusting AI output
The Truth Behind Experts Who “Just Hand Everything to AI” - Expert AI usage patterns
The AI Delegation Paradox: Why Passive Tools Create Active Humans - The paradoxical effects of AI delegation

References

References are listed in the order they appear in the text.

Additional References (not cited by number in the text)

The reality of AI-Assisted software engineering productivity - Addy Osmani (2025). Comprehensive review citing Google RCT (21% faster) and others. [Reliability: Medium–High]
My Participation in the METR AI Productivity Study - Domenic Denicola (2025). Reflection from a Study 1 participant. [Reliability: Medium–High]
DORA Report 2025 - Google Cloud (2025). Comprehensive survey of software delivery metrics. [Reliability: High]
[AI 2025 Stack Overflow Developer Survey](https://survey.stackoverflow.co/2025/ai) - Stack Overflow (2025). AI favorability dropped from 70% to 60%. [Reliability: High]

We are Changing our Developer Productivity Experiment Design - METR (2026). Joel Becker, Nate Rush, Tom Cunningham, David Rein, Khalid Mahamud. [Reliability: High] ↩︎ ↩︎² ↩︎³ ↩︎⁴ ↩︎⁵ ↩︎⁶ ↩︎⁷ ↩︎⁸
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR (2025). [Reliability: High] ↩︎
Analyzing coding agent transcripts to upper bound productivity gains from AI agents - METR (2026). Amy Deng. Research note. [Reliability: Medium–High] ↩︎ ↩︎² ↩︎³
Assessing heterogeneity in METR’s late 2025 developer productivity experiment - LessWrong (2026). Third-party statistical analysis of METR Study 2 data. [Reliability: Medium] ↩︎ ↩︎² ↩︎³
The AI Productivity Paradox Report 2025 - Faros AI (2025). Analysis based on telemetry data from 10,000+ developers. [Reliability: Medium–High] ↩︎ ↩︎²
AI Coding Productivity Paradox: 93% Adoption, 10% Gains - Philipp D. Dubach (2026). Comprehensive analysis integrating multiple studies. [Reliability: Medium] ↩︎ ↩︎² ↩︎³

AI・Technology

This post is licensed under CC BY 4.0 by the author.