Your AI Writes Faster. Code Reviews Take 91% Longer. The DORA Data Explains Why.

By Haim Ari · 2026-03-25T14:33:00

The 2025 DORA report buried a dangerous statistic: 90% AI adoption correlates with a 154% increase in PR size and 91% longer code reviews. Developer satisfaction is up. Bug rates are up too. Here's what's actually happening — and the one CLAUDE.md change that fixes it.

There's a number in the 2025 DORA report that most engineering leaders gloss over.

It's not the headline: that 90% of developers now use AI tools daily. That number gets quoted everywhere. It's the one buried three sections deeper: code review time is up 91%. In the same teams. In the same year.

59% of those developers say AI improved their code quality. The production data says bug rates climbed 9%.

Someone is wrong. And it's not the production data.

The Productivity Illusion

Here's what's actually happening. AI makes the coding step faster. Genuinely, measurably faster. The tools are good. The benchmarks are real. When a developer uses Claude Code or Copilot to implement a feature, they ship the first working version in less time than before.

But "first working version" and "merged PR" are separated by code review. And AI is making that gap significantly worse.

The DORA 2025 data is specific:

The last row is the tell. Developers feel like they're shipping better code. The first three rows describe what's actually in the repository.

AI doesn't write small PRs. It writes complete implementations. When you ask it to build a feature, it builds the entire feature — authentication, error handling, tests, edge cases — in one pass. The output is technically correct, often impressively comprehensive, and typically 3x the size of what a developer would write incrementally.

That output lands in your review queue, and a human has to evaluate all of it.

The Review Queue Is Drowning

PR size matters for review quality in a way that's been studied extensively. The research is consistent: above 400 lines of change, review effectiveness degrades sharply. Reviewers start skimming. Context gets lost. The kinds of bugs that reviewers catch — logic errors, missing edge cases, security issues — start slipping through.

A 154% increase in average PR size doesn't mean reviews take 154% longer. It means reviews become 154% harder to do well — and the time increases as reviewers compensate by reading more carefully, asking more questions, or (more commonly) approving things they haven't fully understood.

40-62% of AI-generated code contains security vulnerabilities, according to multiple independent analyses. The code reviews that are supposed to catch those vulnerabilities are happening on PRs that are too large to review effectively, conducted by engineers who are already stretched because they're also using AI to write code (and therefore generating more PRs themselves).

This is the feedback loop nobody warned you about. More AI adoption → bigger PRs → overwhelmed reviewers → lower review quality → more bugs in production → longer time debugging → less time for the reviews that would have caught those bugs.

You didn't slow down the AI. You slowed down the review.

What's Actually Happening on Your Team

Run this command against your repo:

``bash

gh pr list --repo

--json additions,deletions,mergedAt,title \

{

size: (.additions + .deletions),

merged: .mergedAt,

title: .title

Compare the output from before and after your team adopted AI tools. If you're seeing the DORA pattern, your median PR size has roughly doubled.

Now look at your cycle time — the elapsed time between PR creation and merge. If cycle time went up at the same time AI adoption went up, you have the same problem.

Here's a more detailed breakdown that computes monthly averages:

`bash

gh pr list --repo

--json additions,deletions,createdAt,mergedAt \

| jq '

group_by(.mergedAt[0:7]) |

map({

month: .[0].mergedAt[0:7],

avg_size: (map(.additions + .deletions) | add / length),

pr_count: length,

avg_cycle_hours: (

map(

((.mergedAt fromdateiso8601)) / 3600

) | add / length

)

})

Most teams that run this see a clean inflection point: the month AI tooling was rolled out, PR size jumps. Cycle time follows it up within one or two months as reviewers start feeling the pressure.

The Fix Nobody Is Implementing

The instinct, when you see this data, is to tell reviewers to review faster. That's the wrong fix. Faster reviews of large PRs produce worse outcomes, not better ones.

The right fix is upstream: constrain what the AI ships.

This is a CLAUDE.md change:

`markdown

PR Scope Guidelines

• Target PR size: ≤400 lines of additions

• Hard limit: 800 lines of additions

• When a task requires more than 800 lines:

• Split the implementation into sequential, independently-deployable PRs

• Each PR should leave the system in a working state

• Describe the decomposition plan in the first PR's description

• Never include unrelated refactors or cleanups in a feature PR

• Never combine frontend and backend changes unless they must ship atomically

Add it to your repository's CLAUDE.md. Every session that operates in this repository will see it. The AI will naturally decompose large tasks into smaller PRs rather than shipping one 2,000-line monolith.

You can also add a hook to warn before large diffs are created. Add this to your Claude Code settings:

`json

{

"hooks": {

"PreToolUse": [

{

"matcher": "Bash",

"hooks": [

{

"type": "command",

"command": "bash -c 'if git diff --cached --stat 2/dev/null awk \"{sum += \\$1} END {print sum}\"); if [ \"$LINES\" -gt 800 ]; then echo \"WARNING: Staged diff is $LINES lines. Consider splitting into smaller PRs.\"; fi; fi'"

}

]

}

]

}

This won't block large commits — it warns about them. That warning, appearing in the agent's context at the moment it would create a large commit, prompts it to reconsider scope before you ever see the PR.

Why Self-Reported Metrics Are Lying to You

The 59% of developers who report AI improved their code quality aren't lying. They're describing their experience accurately.

The code they write feels better. It's more complete. It handles edge cases they would have missed. It has tests they would have skipped at 5 PM on a Friday. The individual unit of output — the piece of code they hand to AI and get back — is genuinely higher quality than what they would have written alone.

What they're not measuring is what happens to that code downstream.

It enters a review queue that's becoming harder to work through. It gets reviewed by engineers who are also overwhelmed. It ships with bugs that might have been caught by a reviewer who had enough bandwidth to read carefully. It creates incident retrospectives that don't connect the dots to the upstream AI adoption that made the PR too big to review.

The experience of writing code improved. The system-level outcome quietly degraded.

This is the hardest part of managing AI adoption: the feedback loops are long and indirect. The developer who shipped the 1,200-line PR never sees the production bug their reviewer missed at line 847. They see the PR approved and feel successful. The system sees a bug and doesn't trace it back to scope.

The Teams Getting This Right

The engineering teams with the best post-AI outcomes aren't the ones who gave their developers the best AI tools. They're the ones who redesigned the workflow around those tools.

Stripe's "toolshed" approach — documented in their engineering blog — includes explicit PR size standards for AI-generated code. StrongDM's Software Factory sidesteps the review problem entirely by replacing human review with an automated test harness that runs thousands of scenarios per hour. Both are solving the same underlying problem: the AI's natural output granularity doesn't match the human review process.

You don't have to build a Software Factory or a toolshed to fix this. You need two things:

A CLAUDE.md file that sets PR size expectations. This is a 10-minute change that constrains every AI session operating in your repo.

Cycle time tracking by PR size bucket. Split your PRs into small (<400 lines), medium (400–800), and large (800), and measure cycle time and post-merge bug rate for each bucket. The data will show you what your reviewers already know intuitively.

The second change gives you the signal you need to prove the first change is working. Run it for six weeks after updating CLAUDE.md. The distribution should shift left — more small PRs, fewer large ones. Cycle time should follow.

The Bottom Line

The 2025 DORA report is telling you something specific: the bottleneck moved. For years, the bottleneck in software development was writing code. AI has genuinely, measurably improved that step. But a faster input to a fixed-capacity review queue doesn't produce a faster output — it produces a backup.

The 91% increase in code review time isn't a failure of AI adoption. It's a sign of incomplete AI adoption — teams that automated the writing step without redesigning the review step to match.

The fix is not complicated. It's a few lines in CLAUDE.md and a change in how you measure success: not just "how fast are we shipping code" but "how fast are those PRs moving through review, and what's escaping to production."

Shipping faster only matters if you're shipping well.

Related: Your AI Does Exactly What You Ask. That's the Problem. — How Claude Code hooks prevent the AI from taking actions your reviewers should have caught.