Post
JA EN

After "AI Makes You 19% Slower" — The Selection Bias METR Acknowledged, and the Evolving Truth About Productivity

After "AI Makes You 19% Slower" — The Selection Bias METR Acknowledged, and the Evolving Truth About Productivity
  • Target audience: Software engineers, developers who use AI tools daily
  • Prerequisites: Experience using AI tools like GitHub Copilot, Cursor, Claude Code
  • Reading time: 15–20 minutes

Overview

“Experienced developers become 19% slower when using AI” — this finding, published by METR (Model Evaluation & Threat Research) in July 2025, was widely cited as evidence that AI coding tools might not deliver on their promises.

Then, on February 24, 2026, METR itself announced a change in experimental design1. The reason: selection bias — 30–50% of developers said they didn’t want to work without AI, undermining the study’s reliability.

This article accurately examines both the original study (July 2025) and the follow-up (August 2025 onward, published February 2026), attempting the most honest answer possible to the question “Does AI make developers faster or slower?” as of March 2026.

Timeline of the METR Research

Study 1 (February–June 2025)

The METR study was conducted under the following conditions2:

  • Participants: 16 experienced open-source developers
  • Tasks: 246 tasks
  • Method: Randomized controlled trial (RCT) — AI usage was randomly permitted or denied for each task
  • Repositories: Projects the participants had contributed to for multiple years
  • Tools: Primarily Cursor Pro and Claude 3.5/3.7 Sonnet
  • Compensation: $150/hour

Key Findings

  1. Using AI made tasks take 19% longer (confidence interval: +2% to +39%)
  2. Developers predicted they would be “24% faster” beforehand
  3. Even after being slower, they believed they had been “20% faster”

A 39-point gap between perception and reality — this was the study’s most striking number, and the one most widely reported.

Study 2 (August 2025 onward)

METR scaled up and replicated the experiment1:

  • Participants: 57 (10 from the original study + 47 new)
  • Tasks: 800+
  • Repositories: 143 (more diverse — including smaller, greenfield, and less mature projects)
  • Tools: Latest AI tools (including agentic tools like Claude Code and Codex)
  • Compensation: $50/hour (one-third of Study 1)

Results

CohortEstimated SpeedupConfidence Interval
Study 1 participants (continuing)-18% (18% faster)-38% to +9%
New participants-4% (4% faster)-15% to +9%
Study 1 (reference)+19% (19% slower)+2% to +39%

Looking at the numbers alone, this appears to be a dramatic improvement from “19% slower” to “18% faster.” However, METR does not take these results at face value.

Three Problems METR Itself Acknowledged

Problem 1: Developers Refused to Work Without AI

Throughout 2025, the adoption of agentic tools like Claude Code and Codex expanded rapidly. This created a significant shift in study participation1:

“The fraction of developers saying they don’t want to do 50% of their work without AI is increasing. This is despite our study paying $50/hour to work on tasks they enjoy.”

In other words, the developers who benefit most from AI are the least likely to participate. This biases the study’s estimates downward.

Problem 2: Worsening Task Selection Bias

Even participating developers were selective about which tasks they submitted1:

“In surveys, 30–50% of developers reported not submitting some tasks because they wouldn’t want to do them without AI.”

One developer’s testimony vividly illustrates the problem:

“I’ve realized I’m actually doing quite biased task selection… I avoid tasks that AI could finish in 2 hours but would take me 20 hours manually. If that task got assigned to the no-AI condition, it would be really painful.”1

The tasks where AI adds the most value were being systematically excluded from the study.

Problem 3: Inability to Measure Parallel Work

The emergence of agentic AI tools fundamentally changed how developers work1:

  • Running multiple AI agents simultaneously while doing other work
  • Starting new tasks while waiting for agents to complete
  • Ambiguity about how to measure “parallel work” time
flowchart TB
    B1["⏱️ Traditional: Start Task A"]
    B1 --> B2["Task A Complete"]
    B2 --> B3["Start Task B"]
    B3 --> B4["Task B Complete"]

    A1["🤖 Agentic:<br>Assign Task A to Agent"]
    A1 --> A2["Start working on Task B"]
    A2 --> A3["Review Agent A's results"]
    A3 --> A4["Assign Task C to Agent"]

    B4 --> Q["⚠️ RCT assumes '1 task = 1 session'<br>but parallel work breaks this"]
    A4 --> Q

METR’s transcript analysis3 found a strong correlation between parallel agent usage and time savings. The researcher achieving the highest time savings ran an average of 2.32 main agents simultaneously, recording 11.62x time savings. Other staff members ran 1.05–1.52 agents with lower savings rates.

However, this analysis was based on 5,305 Claude Code transcripts from 7 internal METR staff members, and METR itself notes this represents a soft upper bound (actual productivity multipliers would be lower)3.

Correcting for Selection Bias: Where Is the “True Effect”?

A third-party statistical analysis posted on LessWrong4 dug into the heterogeneity of METR’s data:

  • Overall: ~6% speedup
  • Tasks predicted to benefit most from AI (predicted AI advantage of 60+ minutes): 12% speedup
  • Most effective developers: 25% speedup

This analysis applied a heuristic correction assuming “50% of tasks/developers were excluded by selection bias,” estimating the true speedup at approximately 20%4.

The confidence intervals are wide and the correction method is heuristic, so this isn’t a definitive number. But the direction is clear: METR’s measurements represent a lower bound, and the true effect is likely higher.

However, “AI Makes You Faster” Isn’t Universal

Let’s push back against premature optimism here.

Bottleneck Migration: The Code Review Crisis

Telemetry data collected by Faros AI from over 10,000 developers5 shows that AI productivity gains disappear at the organizational level.

MetricHigh AI Adoption vs. Low AI Adoption Teams
Tasks processed+21%
PRs merged+98%
Average PR size+154%
Review time+91%
Bugs per developer+9%
Org-level DORA metricsNo change
flowchart TB
    A["✅ AI doubles code generation speed"] --> B["PRs increase +98%<br>Size also +154%"]
    B --> C["Review time inflates +91%"]
    C --> D["❌ Org delivery speed<br>unchanged"]

This is a textbook example of Amdahl’s Law. Coding accounts for only 25–35% of the software development lifecycle6. Even if coding becomes 100% faster, the overall improvement caps at 15–25%. In reality, the bottleneck simply shifted to code review.

Macroeconomic Data Shows No Change

Philipp Dubach’s comprehensive analysis6 highlights the silence in macro data:

  • Apollo Global Management chief economist Torsten Slok: “AI is everywhere except in the macroeconomic data”
  • NBER’s February 2026 survey: Over 80% of firms reported no productivity impact from AI in the past 3 years
  • Expected improvement over the next 3 years: 1.4%
  • 2024 Nobel economics laureate Daron Acemoglu: AI-driven total factor productivity growth will be 0.5% over the next decade

Despite a 92.6% adoption rate, no change is measurable at the organizational or economic level.

Code Quality Concerns

Data on AI-generated code quality is not reassuring either6:

  • Veracode: 45% of AI-generated code contains OWASP Top 10 vulnerabilities
  • CodeRabbit: AI-generated code has 2.74x more security vulnerabilities
  • Black Duck 2026 OSSRA Report: Vulnerabilities per codebase increased +107% year-over-year (280 → 581)
  • AI-generated code contains 1.7x more issues than human-written code

The Real Question METR Is Asking

Synthesizing the discussion above, a clear picture emerges:

flowchart TB
    T["✅ Task Level<br>Evidence of improvement<br>(METR follow-up, Google RCT)"]
    TM["⚠️ Team Level<br>Volume increases but review<br>becomes bottleneck (Faros AI)"]
    O["❌ Org Level<br>Delivery speed & quality<br>unchanged (DORA 2025)"]
    MA["❌ Macro Level<br>Economy-wide productivity<br>unchanged (NBER 2026)"]

    T --> TM --> O --> MA

The true contribution of METR’s research isn’t the “19% slower” number itself. It’s revealing the far more complex reality behind the simple narrative that “AI makes developers faster.”

What We Can Say as of March 2026

Based on the timeline of METR’s research, the following represents the most honest assessment at this point.

Near-Certainties

  1. METR Study 1’s “19% slower” is unsuitable for generalization due to selection bias and specific conditions. METR itself acknowledges this1.
  2. At the task level, AI is making many developers faster. However, the magnitude of the effect depends heavily on context.
  3. A gap exists between perception and reality. Developers tend to overestimate AI’s actual impact. The 39-point perception gap was confirmed in the follow-up study.

Accumulating Evidence

  1. Parallel agent usage dramatically changes outcomes. Measuring productivity in the traditional “one task → complete → next task” pattern likely fails to capture the true value of agentic AI3.
  2. At the organizational level, the bottleneck shifts to code review. Individual speed gains don’t translate directly to organizational improvement5.

Still Unknown

  1. AI’s true task-level effect. The bias-corrected estimate is approximately 20% speedup4, but confidence intervals are wide.
  2. Long-term macroeconomic effects. Like 1990s IT investment, there may be a delayed impact that hasn’t yet materialized.

Your Experience Is Probably Right — With Three Caveats

If you feel that “AI is clearly making me faster,” your experience is likely not wrong. METR itself states that “as of early 2026, developers are likely faster with AI tools”1.

However, keep these points in mind:

  1. The perception-reality gap still exists. Don’t over-trust your own sense of improvement — measure it when possible.
  2. You being faster doesn’t mean your team or organization is faster. Pay attention to where bottlenecks migrate.
  3. Watch AI-generated code quality. Ensure you’re not trading speed for increased security risk.

Research isn’t “wrong,” and neither is your experience. The problem is that both represent partial truths.

References

References are listed in the order they appear in the text.

Additional References (not cited by number in the text)

  1. We are Changing our Developer Productivity Experiment Design - METR (2026). Joel Becker, Nate Rush, Tom Cunningham, David Rein, Khalid Mahamud. [Reliability: High] ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7 ↩︎8

  2. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR (2025). [Reliability: High] ↩︎

  3. Analyzing coding agent transcripts to upper bound productivity gains from AI agents - METR (2026). Amy Deng. Research note. [Reliability: Medium–High] ↩︎ ↩︎2 ↩︎3

  4. Assessing heterogeneity in METR’s late 2025 developer productivity experiment - LessWrong (2026). Third-party statistical analysis of METR Study 2 data. [Reliability: Medium] ↩︎ ↩︎2 ↩︎3

  5. The AI Productivity Paradox Report 2025 - Faros AI (2025). Analysis based on telemetry data from 10,000+ developers. [Reliability: Medium–High] ↩︎ ↩︎2

  6. AI Coding Productivity Paradox: 93% Adoption, 10% Gains - Philipp D. Dubach (2026). Comprehensive analysis integrating multiple studies. [Reliability: Medium] ↩︎ ↩︎2 ↩︎3

This post is licensed under CC BY 4.0 by the author.