The Moment a DevOps Agent Stops Being a Demo

By Haim Ari · 2026-04-17T14:00:00

Every internal agent starts as a Teams/Slack screenshot. Most die there. The one that lives past the demo is the one that owns the part your engineers hate — deploys — and does it without a human in the loop for the boring 90%. Here's what it took to get there, and why Kubernetes metadata is the moat nobody is talking about.

Every internal AI agent starts its life as a Teams/Slack screenshot.

In our case, it was Teams — that's where our engineers live, so that's where I wired it in. The specifics below are the Teams flavor, but nothing about the architecture is platform-specific.

"Look, it answered a Jira question." "Look, it opened a PR." The demo lands, everyone claps, and then the agent quietly dies in a private channel six weeks later because nobody trusted it to touch anything that mattered.

The shift from demo to production isn't about the model. It isn't about the prompt. It's about a boring, unglamorous question: does the agent actually know what your infrastructure looks like?

The day the answer became "yes" for the internal DevOps agent I've been working on, everything changed. I want to walk through what that looked like — the architecture, the specific moments it started behaving like a teammate instead of a chatbot, and why I think the next five years of engineering work is going to look almost nothing like the last five.

Why We Needed One

We run a large fleet of internal services, each of them eventually landing in a managed Kubernetes cluster through a three-repo fan-out: service repo → Helm charts repo → Argo CD apps repo. Every new deploy has to touch the correct ingress class (one for private services, another for public), SealedSecrets pinned to a specific namespace, a standardized GitHub Actions layout, canary defaults in Argo Rollouts, and a long list of other conventions that live mostly in senior engineers' heads.

Onboarding a new service the "correct" way is a couple of hours of someone walking a junior through the rules. The rules drift. New repos get created that violate them. Conventions rot. Tribal knowledge evaporates every time someone leaves.

The obvious fix is documentation. We've all tried documentation. Documentation drifts faster than code.

The less obvious fix is an agent that derives conventions from the live fleet itself, because in a GitOps shop the cluster is the source of truth. If cluster == git == desired state, then you don't need a human to tell you what "correct" looks like. You can look at what's running.

That's the thesis the whole thing is built on.

The Architecture, Honestly

The stack is intentionally boring:

• Microsoft Teams bot — the front door. Accepts DMs ("deploy this repo"), GitHub webhook events (new PR created), and @mentions in channels. Runs as a Python Bot Framework service in the cluster.

• An agentic router — a cheap Haiku tool-use call that classifies every incoming Teams DM into one of a handful of intents: start a build, forward a knowledge question, look up a ticket, list my recent tickets, refresh conventions, or just text reply. Without this, every "hi" was spawning a full agent session.

• Anthropic Managed Agents — the actual reasoning runs inside Anthropic's beta.agents/environments/vaults/sessions API. More on why this mattered below.

• Three MCP servers — Atlassian (Jira + Confluence), GitHub (via the official GitHub MCP), and a custom Go service called k8s-mcp that exposes read-only Kubernetes discovery as tools.

• A Greptile review-fix loop — once the agent opens a PR, a background loop triggers Greptile, waits for the review, spawns a fix session, pushes, re-triggers, and repeats up to 10 iterations until the review hits 5/5.

That last point is the one that separates the demo from the product. Nobody wants an agent that opens half-broken PRs and waits for a human to babysit them.

What a Live Session Looks Like

Before any of the architecture is worth anything, here is what a real deploy DM actually renders as, from the moment the agent picks up the request to the moment the service is live. This is what scrolls by in Teams while the engineer gets coffee.

.devops-term-bar{background:#12151e;padding:.6rem 1rem;display:flex;align-items:center;gap:8px;border-bottom:1px solid rgba(255,255,255,0.04)}

.devops-term-dot{width:10px;height:10px;border-radius:50%}

.devops-term-dot.r{background:#ff5f57}.devops-term-dot.y{background:#febc2e}.devops-term-dot.g{background:#28c840}

.devops-term-title{font-size:.82rem;color:#8b8f9a;margin-left:8px}

.devops-term-title strong{color:#c9ccd4;font-weight:500}

.devops-term-body{padding:1rem 1.2rem;font-size:.88rem;line-height:1.75;min-height:420px}

.devops-term-line{opacity:0;transform:translateY(4px);white-space:nowrap;overflow:hidden;text-overflow:ellipsis;animation:devopsTermReveal .35s ease forwards}

.devops-term-arrow{color:#8b8f9a}

.devops-term-cmd{color:#c9ccd4}

.devops-term-ok{color:#28c840}

.devops-term-err{color:#ff5f57}

.devops-term-warn{color:#fbbf24}

.devops-term-info{color:#8b8f9a}

.devops-term-done{color:#28c840;font-weight:500}

.devops-term-url{color:#d97757;text-decoration:underline;text-decoration-color:rgba(217,119,87,0.35);text-underline-offset:3px}

.devops-term-final{margin-top:.5rem;padding-top:.5rem;border-top:1px solid rgba(255,255,255,0.04)}

@keyframes devopsTermReveal{from{opacity:0;transform:translateY(4px)}to{opacity:1;transform:translateY(0)}}

.devops-terminal .l1{animation-delay:.4s}

.devops-terminal .l2{animation-delay:1.3s}

.devops-terminal .l3{animation-delay:2s}

.devops-terminal .l4{animation-delay:3.2s}

.devops-terminal .l5{animation-delay:3.7s}

.devops-terminal .l6{animation-delay:4.5s}

.devops-terminal .l7{animation-delay:5s}

.devops-terminal .l8{animation-delay:6s}

.devops-terminal .l9{animation-delay:6.5s}

.devops-terminal .l10{animation-delay:7.3s}

.devops-terminal .l11{animation-delay:8s}

.devops-terminal .l12{animation-delay:9.2s}

.devops-terminal .l13{animation-delay:10s}

.devops-terminal .l14{animation-delay:11.4s}

.devops-terminal .l15{animation-delay:12s}

.devops-terminal .l16{animation-delay:13s}

.devops-terminal .l17{animation-delay:13.9s}

.devops-terminal .l18{animation-delay:14.5s}

.devops-terminal .l19{animation-delay:16s}

.devops-terminal .l20{animation-delay:16.8s}

.devops-terminal .l21{animation-delay:17.2s}

.devops-terminal:hover .devops-term-line{animation:none;opacity:1;transform:none}

Managed Agents Changed the Bootstrap Math

I've built enough agent infrastructure in the last two years to have strong opinions about the setup cost. Token refresh logic, per-session credential injection, MCP server lifecycles, sandbox environments, network egress rules — it's the part nobody talks about because there's no clever prompt involved, but it's where most internal agents get stuck.

The Managed Agents beta collapses that into three primitives: an environment (the sandbox with allowed hosts), an agent (model + system prompt + tool declarations), and a vault (OAuth credentials and static bearers that the platform rotates for you). Once the three exist, launching a session is a single API call.

Here's the actual bootstrap — roughly thirty lines to stand up an agent that can talk to Jira, GitHub, and our internal Kubernetes cluster:

``python

import anthropic

client = anthropic.Anthropic()

environment = client.beta.environments.create(

name="devops-agent-prod",

config={

"type": "cloud",

"networking": {

"type": "limited",

"allowed_hosts": [

"mcp.atlassian.com", "api.atlassian.com",

"api.githubcopilot.com", "api.github.com",

"k8s-mcp.internal.example.com",

)

agent = client.beta.agents.create(

name="DevOpsAgent",

model="claude-opus-4-7",

system="You are our engineering agent. Turn chat messages into PRs. ...",

tools=[

{"type": "agent_toolset_20260401", "default_config": {"enabled": True}},

{"type": "mcp_toolset", "mcp_server_name": "atlassian"},

{"type": "mcp_toolset", "mcp_server_name": "github"},

{"type": "mcp_toolset", "mcp_server_name": "kubernetes"},

mcp_servers=[

{"type": "url", "name": "atlassian", "url": "https://mcp.atlassian.com/v1/sse"},

{"type": "url", "name": "github", "url": "https://api.githubcopilot.com/mcp/"},

{"type": "url", "name": "kubernetes", "url": "https://k8s-mcp.internal.example.com/mcp"},

)

vault = client.beta.vaults.create(display_name="devops-agent-credentials")

The part that used to be a two-week Jira epic — credential rotation, refresh tokens, OAuth redirect flows — becomes a few vaults.credentials.create calls. Atlassian's refresh token, GitHub's long-lived PAT, and a static bearer for the k8s MCP are all owned by the platform after this runs once. The worker that launches sessions never sees a token.

That's the part I think most internal teams underestimate when they read "build your own agent." The prompt is easy. The plumbing is what kills you. Managed Agents doesn't make the plumbing free, but it makes it small enough to fit on one screen.

Why Kubernetes Metadata Is the Moat

The first version of the agent had two MCP servers: Atlassian and GitHub. It could answer Jira questions, open PRs, comment on issues. Everyone was impressed for about four days. Then the first real deploy request came in — "deploy this repo to staging" — and the agent produced three PRs that were confidently, specifically wrong.

Wrong ingress class. Wrong image registry path. SealedSecret sealed against the wrong namespace. A plain Deployment where the rest of the fleet uses canary Rollouts.

None of that information lives in GitHub. It lives in the live cluster state. The fleet is the ground truth. And without a tool that can query the fleet, a brilliant model produces brilliant hallucinations.

That's the point where we wrote k8s-mcp, a Go service that implements the MCP protocol and exposes five tools: k8s_list_kinds, k8s_list, k8s_get, k8s_describe, k8s_logs, and eventually k8s_seal_secret. It's discovery-based — every GVR in the cluster is queryable, including CRDs we hadn't thought about. No informers, no persistent state. The agent asks "what does a healthy stateless HTTP service look like in this cluster?" and gets back real YAML from real Pods.

The practical effect: the same "deploy this repo" request now produces three PRs that match an exemplar already running in the cluster. The right ingress class, because that's what comparable private services use here. The right image registry path, because that's the pattern in the fleet. SealedSecret sealed against the correct namespace, because the agent checked. An Argo Rollout with canary steps copied from a neighbor, not invented from training data.

Model quality barely moved. The fix was giving the model access to the metadata that describes reality.

This is the generalizable lesson. Every company has a version of this. For Stripe it's their internal APIs. For Shopify it's their schema registry. For you it's probably some mix of infrastructure state, an ADR repo, a conventions document, and a few Confluence pages that only one person remembers exist. Until an agent can see that, it's guessing. The moment it can, it stops being impressive and starts being useful.

The Review-Fix Loop

The second moment the agent stopped being a toy: the day Greptile's review score went from 3/5 to 5/5 without a human intervening.

The flow runs entirely in the Teams bot:

The agent opens the PRs.

The bot comments @greptileai please review on each.

A background poll waits for Greptile to return a score.

If the score is below 5/5, the bot dispatches a fix session — a second managed agent session with the review comments as input and instructions to push fixes to the same branch.

After the push, the bot re-triggers Greptile and loops.

Up to 10 iterations. Parallel across PRs via asyncio.gather.

The key trick is that fix sessions run the same agent with the same conventions context. They don't need to re-discover the fleet — the system prompt already has it. The loop is cheap because each iteration is focused: "here are three specific review comments, here is the PR diff, produce a patch."

The code for the orchestration layer is about 200 lines. The hard part was not the loop — it was giving fix sessions enough context to not undo the original PR's work while fixing the review findings. That's where the compiled conventions prompt matters. The fix session knows the rules because they're baked into the agent version, not passed per-request.

Try This Yourself

Start small. The pattern works at any scale.

Pick one workflow your engineers hate. Not "automate everything." One thing. For us it was deploys. For you it might be onboarding a new microservice, promoting a staging release, or closing out completed Jira tickets.

Name the three metadata sources the agent needs to do it correctly. Almost every internal workflow sits on top of three pieces of context: ticket state (Jira/Linear), code state (GitHub), and runtime state (cluster/APM/database). Skip any of them and the agent hallucinates.

Write the thinnest possible MCP server for the piece that isn't already available. Our Kubernetes MCP service is a small Go binary — under a thousand lines, written in about a week. Not a month. The MCP protocol is deliberately small — list tools, call tool, return content. If you can write an HTTP handler, you can write an MCP server.

`go

// The whole tool contract is this shape:

type Tool struct {

Name string

Description string

InputSchema json.RawMessage

}

func (s *Server) handleCall(ctx context.Context, name string, args json.RawMessage) (mcp.Result, error) {

switch name {

case "k8s_list": return s.list(ctx, args)

case "k8s_get": return s.get(ctx, args)

case "k8s_describe": return s.describe(ctx, args)

case "k8s_logs": return s.logs(ctx, args)

}

return mcp.Result{}, fmt.Errorf("unknown tool: %s", name)

}

Put a cheap router in front of the expensive model. Every DM through a Haiku tool-use call that decides whether to invoke the full agent or reply directly. The router prompt is ~30 lines. It cuts your agent spend by roughly the fraction of DMs that are greetings, which in our case was about 40%.

Close the loop with review automation. An agent that opens PRs a human has to fix is a net negative. An agent that opens PRs and fixes its own review findings is a teammate. Wire Greptile (or any review tool with a scoring API) into your bot and auto-dispatch fix sessions until the score passes. Cap the loop at 10 iterations so it can't go infinite.

Deploy it where work already happens. Teams, Slack, Discord — whatever your company uses. The front door shouldn't be a new interface engineers have to learn. It should be the DM they were already going to send to the DevOps person who used to do this manually.

A weekend project can cover steps 1, 2, 4, and 6. Step 3 is a week if the metadata source is custom. Step 5 is a week if you've never built a review loop before. Two to three weeks, end to end, for a working internal agent. That's the real timeline, not the six-month "AI platform" timeline most companies quote.

Why This Matters More Than the Next Model Release

Here's the thing that's been bothering me for the past year.

The discourse around AI in engineering is still mostly about coding. Will AI replace programmers? How much of my code is written by Copilot? What's the best model for refactoring?

That framing is already obsolete. Coding was never the bottleneck for most companies. The bottleneck is the coordination overhead around code: deploys, reviews, ticket hygiene, convention enforcement, onboarding, incident response, the eighty tiny rituals that make a codebase actually ship to production. Those rituals consume more senior engineer time than writing code ever did.

Agents that understand the metadata layer — the conventions, the cluster state, the ticket graph, the review criteria — automate the rituals. The ones that don't, keep automating the thing that was already easy: generating code.

I think the five-year trajectory is this: coding itself becomes a smaller and smaller fraction of the engineering job, because a model that can see your metadata can produce correct code in one shot for most problems. What remains is judgment about what to build, what to connect, and what to automate next. The teams that figure out the plumbing now — MCP servers over their internal systems, agents wired into where work happens, review loops that close themselves — will compound that advantage every quarter.

The teams that keep running agent demos in isolated channels will keep wondering why AI hasn't "changed anything yet."

The Bottom Line

The moment our DevOps agent stopped being a demo wasn't when the model got smarter. It was when the agent could see the cluster.

Managed Agents made the bootstrap plumbing small. MCP made the metadata integration small. The custom Go service for Kubernetes was a week of work. The review-fix loop was a weekend. None of these pieces are clever individually — stacked on top of each other, they turn a chatbot into the thing a senior engineer would actually let touch production.

If you work at a company that runs on Kubernetes, GitHub, and a ticket system, you already have ninety percent of what you need to build this. The missing ten percent is a small service that teaches an agent what your infrastructure actually looks like. Write it. The day your agent stops guessing and starts reading is the day your engineering team starts compounding.

Coding is not the moat anymore. The metadata is.