Skip to content
10 min read

Code Review Is the New Bottleneck, and Your Tooling Has Not Caught Up

AI tools generate code faster than teams can review it. GitHub's Stacked PRs signal a shift in where the real constraint lives. Here is how to rethink your review practices.

Antonio J. del Águila

Knaisoma

Sameen Karim, one of the engineers behind GitHub’s new Stacked PRs feature, put it plainly: “The bottleneck is no longer writing code, it’s reviewing it.” That sentence carries more strategic weight than most engineering leaders have stopped to process. We have spent the last two years obsessing over how to make developers write code faster. The tools have worked. Now the queue is somewhere else.

GitHub projects 14 billion commits in 2026, compared with 1 billion last year. AI coding assistants can produce a 2,000-line diff across 40 files in seconds. The developer who used to spend an afternoon building a feature now ships the first cut before lunch. The pull request lands in the review queue, joins a dozen others, and waits. Sometimes for hours. Sometimes for days. The feature is done; the feature is not shipped. That gap is the new constraint, and most organizations are not measuring it.

On April 13, 2026, GitHub shipped Stacked PRs into private preview. The feature is technically modest: pull requests that can be based on other pull requests, forming a reviewable chain. But the fact that GitHub built it at all, and the framing they chose to announce it, is a signal worth reading carefully. GitHub does not build infrastructure for problems that do not exist at scale.

When the bottleneck moves

The mental model most engineering organizations use for delivery velocity is built around writing code. DORA metrics track deployment frequency and lead time. Incident retrospectives examine time to detect and recover. Sprint planning focuses on story points per developer. These are reasonable things to measure, but none of them capture the time a completed change spends waiting for review before it can merge.

That waiting time has become a primary drag on delivery velocity, and the acceleration of AI-assisted development has made it worse faster than most teams anticipated. When a single developer could produce 200 lines of production code per day, review queues were manageable. When that same developer is generating 800 lines per day with an AI assistant, the review burden does not scale linearly. It scales faster, because the reviewers are still human, reviewing is cognitively expensive, and the queue grows faster than capacity to clear it.

The result is observable in any engineering organization that has adopted AI coding tools at meaningful scale: longer time-to-merge, PR queues that grow through the week and clear imperfectly on Fridays, and a quiet accumulation of unreviewed code that creates its own form of inventory risk. Code that is not reviewed is not merged. Code that is not merged is not in production. The velocity gain at the writing stage is being partially absorbed by the slowdown at the review stage, and most teams are not tracking it.

Why large PRs fail review

The cognitive science of code review is not complicated, and the data is consistent. Review effectiveness peaks between 200 and 400 lines of code, with roughly 60 minutes of review time. SmartBear’s analysis of 2,500 pull requests established this baseline, and subsequent research has confirmed the pattern repeatedly. Detection rate drops significantly as PR size increases: reviewers catch defects in roughly 87% of changes under 100 lines, 78% of changes between 100 and 300 lines, 65% between 300 and 600, 42% between 600 and 1,000, and 28% of changes above 1,000 lines. One analysis of over 10,000 PRs found that fewer than a quarter of pull requests exceeding 1,000 lines received any review comments at all.

The mechanism is not reviewer negligence; it is cognitive capacity. A reviewer approaching a 2,000-line pull request faces a genuine working memory problem. The context required to evaluate a change correctly, its purpose, its constraints, the way it interacts with adjacent systems, grows with the size of the change in ways that working memory cannot accommodate. Reviewers respond to this constraint the way humans respond to most cognitive overload: they simplify. They check that the code compiles, that the test coverage looks reasonable, that the naming is acceptable, and they approve. They do not, and often cannot, reason about whether the change is architecturally sound or whether a subtle interaction three layers deep will surface as an incident six months later.

When AI generates the implementation, the problem compounds. The code is coherent, well-structured, and passes the automated gates. But the reviewer is still being asked to evaluate 2,000 lines of logic they did not write and that no human wrote consciously in the sense of reasoning through each decision. The correctness check passes. The comprehension check, if anyone performs it, often cannot.

We have seen this pattern in client engagements where teams adopted AI coding assistants aggressively in the first quarter after rollout. Review latency increased. Time-to-merge increased. The defect rate in the first month after deployment increased on code where the PRs were large. The velocity gain was real, but some of it was being borrowed from review quality rather than generated from genuine productivity improvements.

Stacked PRs and the decomposition discipline

Stacked PRs are not a new idea. Phabricator, the tool Facebook open-sourced in 2011 and deprecated in 2021, built its entire review workflow around this model. Gerrit, which Google uses internally, has native support for dependent change sets. Graphite has been offering stacked PR tooling for GitHub workflows since 2021. The concept is well understood: rather than accumulating a large change in a single branch and opening one large pull request, you break the change into a chain of smaller, logically independent units. Each unit can be reviewed in isolation. The chain merges bottom-up as each layer is approved.

What changed on April 13 is that GitHub built this capability into the platform directly. That matters for adoption. Developer tools succeed when they are available without friction in the environment developers already work in. Graphite required a workflow change and a third-party tool. GitHub Stacked PRs require neither. The gh-stack CLI is optional; you can manage stacked PRs entirely through the GitHub UI. For teams that prefer the CLI, the commands are straightforward:

# Initialize a stack from main
gh stack init feature/auth-refactor-foundation

# Add a layer to the stack
gh stack add feature/auth-refactor-token-storage

# Add another layer
gh stack add feature/auth-refactor-session-handling

# Push all branches to GitHub
gh stack push

# Open pull requests for the entire stack
gh stack submit

When you run gh stack submit, GitHub creates one pull request per layer, each targeting the branch below it in the stack. The bottom PR targets main. Reviewers see individual, focused diffs rather than one sprawling changeset. When a layer is approved and merged, GitHub automatically rebases the remaining PRs so the stack stays current. AI coding agents can be integrated directly into this workflow via npx skills add github/gh-stack, which lets agents decompose large diffs into a stack rather than producing a single large pull request for human review.

The tooling change is straightforward. The workflow discipline it requires is not. Phil Fersht, CEO of HFS Research, made this point in response to the GitHub announcement: “The constraint will not be the feature itself, but whether development teams adjust their workflow discipline to use stacking properly.” He is right. Stacked PRs are useful only if the person creating the stack has thought carefully about how to decompose the change into units that are both independent enough to be reviewable in isolation and coherent enough that each unit makes logical sense on its own. That discipline does not come automatically from the tooling. It requires deliberate practice and explicit team norms.

GitLab takes a different approach, enabling multiple MR chains through its “draft” system and MR dependencies feature. The underlying principle is the same: smaller, sequential, independently reviewable units of work are more manageable than large monolithic changes. Gerrit has always operated this way. GitHub’s move brings the approach to the largest code hosting platform in the world, at a moment when the need for it has become acute.

Rethinking review culture for the AI era

Tooling enables the practice, but the practice requires cultural and process changes that most engineering organizations have not made. Here is the framework we use with clients who are adjusting their review practices to account for AI-accelerated development.

Set an explicit PR size policy, and make it a guideline rather than a hard gate. A 300 to 500 line threshold is a useful working range. Below 300 lines, most reviewers can hold the full context in working memory. Above 500 lines, the cognitive load starts to degrade review quality noticeably. The policy should not be a CI gate that blocks merge at 501 lines; it should be a shared norm that triggers a conversation when a PR significantly exceeds the guideline. The goal is to normalize decomposition, not to punish large changes. A single migration file that is 1,200 lines of mechanically generated SQL is different from 1,200 lines of novel logic, and your policy should acknowledge the distinction.

Train developers to decompose AI-generated changes before opening a PR. When an AI assistant produces a 2,000-line implementation, the first step is not to open a pull request. The first step is to identify the layers: the data model change, the service layer, the API handler, the integration tests. Each layer is a candidate for a PR in the stack. This is a skill that requires practice, and teams that invest in it early see better outcomes than teams that wait until the review backlog forces the conversation.

Integrate automated pre-review checks to reduce human reviewer load on mechanical issues. A pull request that arrives at a human reviewer already flagged for PR size, missing tests, obvious style violations, and potential security patterns reduces the cognitive burden on the reviewer for exactly the checks where automation is reliable, freeing the reviewer’s attention for the judgments that require human understanding. A minimal GitHub Actions configuration for PR size enforcement:

# .github/workflows/pr-size-check.yml
name: PR Size Check

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  size-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Check PR size
        run: |
          ADDITIONS=$(gh pr view ${{ github.event.pull_request.number }} \
            --json additions --jq '.additions')
          DELETIONS=$(gh pr view ${{ github.event.pull_request.number }} \
            --json deletions --jq '.deletions')
          TOTAL=$((ADDITIONS + DELETIONS))

          echo "PR size: $TOTAL lines changed"

          if [ "$TOTAL" -gt 500 ]; then
            echo "::warning::PR exceeds 500 lines ($TOTAL lines)."
            echo "::warning::Consider splitting into a stack using: gh stack init <branch-name>"
            # Warning only -- does not block merge
            # Change 'warning' to 'error' to enforce as a hard gate
          fi
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Establish review SLAs tied to PR size, and publish them. Small PRs should have a turnaround commitment: 4 hours for PRs under 200 lines, 24 hours for PRs under 500 lines. When reviewers know that a small PR carries a short turnaround expectation, the workflow incentive structure shifts toward creating small PRs. This is not a process overhead; it is a feedback loop that rewards the behavior you want.

Shift the review question from “is this correct?” to “do I understand this?” For AI-generated code, correctness is a lower bar than comprehension. A reviewer who cannot explain why the code is structured the way it is, and what happens when an edge case triggers a path they have not traced, has not completed a review. They have completed a syntax check. This is the comprehension dimension of cognitive debt that the ThoughtWorks Technology Radar identified in their April 2026 edition, and it matters more, not less, for AI-generated code.

Measure review throughput as a first-class engineering metric. Alongside deployment frequency and lead time, track: median time-to-first-review, median time-to-merge, PR queue depth by team, and the ratio of PRs opened to PRs merged per sprint. These numbers tell you where the review bottleneck is and how it is changing. Without them, you are optimizing writing velocity while flying blind on review velocity.

The inventory problem your sprint board does not show

There is a manufacturing concept called work-in-progress inventory: partially finished goods sitting on the factory floor, consuming resources, at risk of becoming obsolete before they ship. Lean manufacturing disciplines treat WIP inventory reduction as a primary lever for improving throughput. The principle transfers directly to software delivery.

Unreviewed code is inventory. A pull request sitting in a review queue represents completed work that cannot ship, tested behavior that cannot be deployed, engineering time that cannot be recovered if the change goes stale. In a pre-AI-tools organization, the inventory problem was manageable because code production rates were slow enough that review capacity could keep pace. In an AI-accelerated organization, that assumption breaks down.

The organizations that will maintain delivery velocity in an AI-accelerated environment are not the ones generating the most code. They are the ones managing the full pipeline, including the review stage, as a coherent system. That means treating review throughput as an engineering discipline with its own metrics, practices, and tooling investment. It means normalizing the stacked PR workflow so that AI-generated changes arrive at reviewers in reviewable form rather than as monolithic diffs that reviewers cannot responsibly evaluate. It means being honest that the PR abandonment rate on large changes is not a reviewer attitude problem; it is a capacity problem with a structural cause.

GitHub’s decision to build Stacked PRs reflects a clear read on where the industry is heading. The question for every engineering leader is whether their team’s practices will adjust at the same pace or whether review backlogs will become the new sprint velocity excuse.

If you are working through how to adjust your review practices for AI-accelerated development, or if your team has adopted AI coding tools and is finding that the gains are not translating cleanly into delivery throughput, we are glad to think through it with you.

Developer Experience Code Review Productivity
Share:

Stay updated

Get insights on engineering transformation delivered to your inbox.

Newsletter coming soon.