--json additions,deletions,createdAt,mergedAt \
| jq '
group_by(.mergedAt[0:7]) |
map({
month: .[0].mergedAt[0:7],
avg_size: (map(.additions + .deletions) | add / length),
pr_count: length,
avg_cycle_hours: (
map(
((.mergedAt fromdateiso8601)) / 3600
) | add / length
)
})
'
`
Most teams that run this see a clean inflection point: the month AI tooling was rolled out, PR size jumps. Cycle time follows it up within one or two months as reviewers start feeling the pressure.
The Fix Nobody Is Implementing
The instinct, when you see this data, is to tell reviewers to review faster. That's the wrong fix. Faster reviews of large PRs produce worse outcomes, not better ones.
The right fix is upstream: constrain what the AI ships.
This is a CLAUDE.md change:
`markdown
PR Scope Guidelines
• Target PR size: ≤400 lines of additions
• Hard limit: 800 lines of additions
• When a task requires more than 800 lines:
• Split the implementation into sequential, independently-deployable PRs
• Each PR should leave the system in a working state
• Describe the decomposition plan in the first PR's description
• Never include unrelated refactors or cleanups in a feature PR
• Never combine frontend and backend changes unless they must ship atomically
`
Add it to your repository's CLAUDE.md. Every session that operates in this repository will see it. The AI will naturally decompose large tasks into smaller PRs rather than shipping one 2,000-line monolith.
You can also add a hook to warn before large diffs are created. Add this to your Claude Code settings:
`json
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "bash -c 'if git diff --cached --stat 2/dev/null awk \"{sum += \\$1} END {print sum}\"); if [ \"$LINES\" -gt 800 ]; then echo \"WARNING: Staged diff is $LINES lines. Consider splitting into smaller PRs.\"; fi; fi'"
}
]
}
]
}
}
``
This won't block large commits — it warns about them. That warning, appearing in the agent's context at the moment it would create a large commit, prompts it to reconsider scope before you ever see the PR.
Why Self-Reported Metrics Are Lying to You
The 59% of developers who report AI improved their code quality aren't lying. They're describing their experience accurately.
The code they write feels better. It's more complete. It handles edge cases they would have missed. It has tests they would have skipped at 5 PM on a Friday. The individual unit of output — the piece of code they hand to AI and get back — is genuinely higher quality than what they would have written alone.
What they're not measuring is what happens to that code downstream.
It enters a review queue that's becoming harder to work through. It gets reviewed by engineers who are also overwhelmed. It ships with bugs that might have been caught by a reviewer who had enough bandwidth to read carefully. It creates incident retrospectives that don't connect the dots to the upstream AI adoption that made the PR too big to review.
The experience of writing code improved. The system-level outcome quietly degraded.
This is the hardest part of managing AI adoption: the feedback loops are long and indirect. The developer who shipped the 1,200-line PR never sees the production bug their reviewer missed at line 847. They see the PR approved and feel successful. The system sees a bug and doesn't trace it back to scope.
The Teams Getting This Right
The engineering teams with the best post-AI outcomes aren't the ones who gave their developers the best AI tools. They're the ones who redesigned the workflow around those tools.
Stripe's "toolshed" approach — documented in their engineering blog — includes explicit PR size standards for AI-generated code. StrongDM's Software Factory sidesteps the review problem entirely by replacing human review with an automated test harness that runs thousands of scenarios per hour. Both are solving the same underlying problem: the AI's natural output granularity doesn't match the human review process.
You don't have to build a Software Factory or a toolshed to fix this. You need two things:
A CLAUDE.md file that sets PR size expectations. This is a 10-minute change that constrains every AI session operating in your repo.
Cycle time tracking by PR size bucket. Split your PRs into small (<400 lines), medium (400–800), and large (800), and measure cycle time and post-merge bug rate for each bucket. The data will show you what your reviewers already know intuitively.
The second change gives you the signal you need to prove the first change is working. Run it for six weeks after updating CLAUDE.md. The distribution should shift left — more small PRs, fewer large ones. Cycle time should follow.
The Bottom Line
The 2025 DORA report is telling you something specific: the bottleneck moved. For years, the bottleneck in software development was writing code. AI has genuinely, measurably improved that step. But a faster input to a fixed-capacity review queue doesn't produce a faster output — it produces a backup.
The 91% increase in code review time isn't a failure of AI adoption. It's a sign of incomplete AI adoption — teams that automated the writing step without redesigning the review step to match.
The fix is not complicated. It's a few lines in CLAUDE.md and a change in how you measure success: not just "how fast are we shipping code" but "how fast are those PRs moving through review, and what's escaping to production."
Shipping faster only matters if you're shipping well.
Related: Your AI Does Exactly What You Ask. That's the Problem. — How Claude Code hooks prevent the AI from taking actions your reviewers should have caught.