Stripe Merged 1,300 AI-Written PRs Last Week. The Secret Isn't the AI.

By Haim Ari · 2026-03-23

Stripe's Minions system generates 1,300+ pull requests per week — every one written by AI, zero by humans. The model they use is available to everyone. The infrastructure they built is not. Here's what's actually different.

Last week, Stripe's engineering team merged over 1,300 pull requests.

Human engineers reviewed them. Human engineers approved them. But not a single line of code in any of them was written by a human hand.

All 1,300+ were generated by Minions — Stripe's unattended coding agents. They run on pre-warmed EC2 instances that spin up in 10 seconds. They read tickets, write code, run tests, and submit PRs without anyone asking them to. Engineers come in and find finished work waiting for review.

Here's the thing that should stop you: the AI model Stripe uses for this is available to you right now. Claude. GPT. Same models your team is already using to autocomplete function signatures.

The difference isn't the model. It's everything around the model.

The Model Isn't the Bottleneck. The Infrastructure Is.

Most teams who try to deploy AI agents at scale hit the same wall. They build a clever agent. They run it against a few test cases. The demo works beautifully. Then they try to use it in production and discover that the agent keeps hallucinating internal API names, can't find the right documentation, doesn't know which Jira ticket to reference, and has no idea how your CI system works.

The agent isn't broken. The infrastructure is missing.

Emily Glassberg Sands, Head of Data & AI at Stripe, put it plainly in a recent interview: "The model is a commodity. Your data architecture is the competitive advantage."

This is the insight most teams are still missing in 2026. You can swap models. You can't swap the institutional knowledge, tooling connections, and execution environment you've built for your agents. That infrastructure is the moat.

Stripe spent two years building theirs.

The Toolshed: 400+ Tools as First-Class Infrastructure

At the center of Stripe's AI engineering stack is something called the Toolshed — a centralized MCP server that exposes over 400 tools to any agent running inside Stripe's infrastructure.

Not 400 variations of "call this endpoint." Four hundred distinct, curated connections to the systems engineers actually use:

• Internal documentation search — not a RAG hack on top of Confluence, but production-grade search against Stripe's actual knowledge base

• Code intelligence via Sourcegraph — agents can navigate the codebase the same way senior engineers do

• Ticket details from internal issue trackers — agents understand what they're building and why

• Build statuses from CI pipelines — agents know when they've broken something

• Slack, Google Drive, Git, query engines — the full operating environment of a real Stripe engineer

Every agent at Stripe — whether it's Claude Code running locally, a custom bot, or a Minion spinning up on EC2 — connects to the same Toolshed. The tools aren't scattered across individual agent configurations. They're shared infrastructure that every AI system at the company inherits automatically.

This is architecturally significant. When you build tools per-agent, you get fragmentation: different agents knowing different things, with no shared foundation. When you build tools as infrastructure, every agent starts with the full context of how your company works.

The Toolshed doesn't give agents all 400 tools at once — access is curated by task type to avoid context bloat. But the toolset exists, it's maintained, and it grows. Adding a new internal system doesn't require updating each agent individually. It means adding one tool to the Toolshed that every agent can now use.

Minions: Unattended at Scale

The Toolshed is the context layer. Minions are the execution layer.

Stripe's Minions are built on a customized fork of Block's open-source Goose project. They run in isolated "devboxes" — sandboxed development environments that have no access to production systems and can be spun up or destroyed without risk. When a Minion makes a mistake, it's contained. When it finishes a task, the devbox disappears.

This isolation is the unlock that makes "unattended" possible.

Most companies run agents with human-in-the-loop checkpoints because they're afraid of what the agent might do if left alone. Fair concern — if the agent can touch production databases, you should be afraid. Stripe solved this by making production access architecturally impossible during the development loop. Minions can't break anything real, so they don't need supervision.

The result: Minions run tasks to completion, not to a checkpoint. They read the ticket, write the implementation, run the tests, fix the failures, and submit the PR. The engineer sees finished work, not a draft waiting for direction.

What does that look like at scale? One benchmark tells the story: a pan-European local payment method integration that previously took two months went to two weeks — and is trending toward days as the Minion pipeline matures.

The Data Layer Nobody Talks About

The Toolshed and Minions are the visible part of Stripe's AI infrastructure. The invisible part is the data architecture underneath them.

Stripe built Hubert — a text-to-SQL assistant that processes roughly 900 queries per week. Not because SQL is hard, but because finding the right dataset in a 10,000-person company is hard. Most AI productivity gains evaporate when agents can't find the data they need to do their jobs.

Hubert solves the data discovery problem by maintaining a semantic layer on top of Stripe's data catalog. Engineers and agents describe what they want in natural language; Hubert finds the right tables, writes the query, and returns results. No more hunting through data catalogs or asking the analytics team which table has which field.

This is the kind of infrastructure investment that doesn't show up in benchmark comparisons. You can't measure "time spent searching for the right dataset" easily. But every engineer who's lost hours trying to find the right data source recognizes the problem immediately.

Stripe's bet is that removing friction from data access compounds over time. When agents can answer data questions instantly, they build better code. When engineers can query any dataset without a lookup, they make better decisions. Small friction reduction, multiplied by 8,500 people and all their AI agents, adds up fast.

Measuring What Actually Matters

Stripe published an integration benchmark alongside the Minions announcement that's worth understanding on its own terms.

They built a realistic test environment that mirrors actual Stripe integration work — implementing checkout flows, configuring webhooks, handling edge cases in the payments API. Not synthetic LeetCode-style problems. Work that resembles what their agents actually do.

Claude Opus 4.5 scored 92% on Stripe's integration benchmark for full-stack API integration tasks. That number only means something in context: these are tasks that previously required senior engineers who understood both the Stripe API and the customer's codebase. At 92%, an agent handles the majority of integration work without human intervention.

Crucially, Stripe measures developer success not just by PR count but by "developer-perceived productivity." They've explicitly resisted making PR volume or lines-of-code the primary metric — a choice that reflects a mature understanding of what AI productivity actually means.

PR count is easy to inflate. A proliferation of small, trivial PRs can look like high productivity while actually fragmenting the codebase and creating review burden. Stripe cares about whether engineers feel like they're doing better work, faster. The 1,300 PRs per week is a consequence of that productivity, not the target.

What Your Team Can Steal From This

Most teams can't replicate Stripe's Toolshed tomorrow. Building 400 MCP tools requires sustained investment in infrastructure that most engineering organizations haven't made.

But the architecture is portable. The lesson isn't "build exactly what Stripe built." The lesson is: treat your agent's tooling as infrastructure, not configuration.

That shift changes how you think about every AI adoption decision:

Instead of: Building a custom agent that queries your documentation

Think: Building a documentation search tool that every agent can use

Instead of: Configuring Claude Code with your CI/CD details

Think: Building an MCP server that exposes build status, test results, and deployment state — and connecting it to every agent that runs in your environment

Instead of: Training each new hire on "how to use AI effectively"

Think: Building the shared context layer that makes every agent immediately effective at your company

The difference between these approaches is compounding. Agent-specific tooling requires maintenance every time you add a new agent. Shared infrastructure gets more valuable as you add more agents and more engineers using those agents.

Stripe has 8,500 employees using AI tools daily — roughly 85% of the company. That number doesn't happen by accident. It happens when the infrastructure makes AI immediately useful to anyone who sits down and starts working.

The Bottom Line

Stripe isn't winning because they have access to better models. They're winning because they've solved the infrastructure problem that makes models actually useful at scale.

The Toolshed gives agents institutional knowledge. Devboxes give agents autonomy without risk. The data layer gives agents access to the information they need. The benchmark gives them a way to measure progress honestly. Together, these systems turn a general-purpose AI model into a reliable engineering team member at Stripe specifically.

Other companies using the same Claude API are getting copilot-grade productivity gains. Stripe is getting 1,300 PRs per week with zero human authorship.

The model is the same. The gap is entirely in the infrastructure.

Emily Glassberg Sands had it exactly right: the model is a commodity. The question isn't which model to use. It's what data your models can access, what tools they can use, and what environments they can operate in safely.

Build that infrastructure. That's the competitive advantage that doesn't transfer.

Related: Three Engineers. 32,000 Lines of Production Code. Zero Written by Hand. — How StrongDM built a fully autonomous software factory with no human code authors.

Related: The LLM Isn't the Bottleneck Anymore. The Ecosystem Is. — Why your workflow architecture matters more than your model choice.