Most Teams Put Agents on a Schedule With No Definition of Done. Anthropic Open-Sourced the Pattern That Fixes It.

By Haim Ari · 2026-07-04T10:00:00

launch-your-agent is a Claude Code skill that takes a team from idea to a live Claude Managed Agent — interview, launch, grade, schedule. For organizations, the value isn't the launch. It's that the default kickoff grades every run against a rubric your team wrote and reviews like code. Here's when Managed Agents is actually the right tool, a real weekly-audit example built step by step, and the discipline worth copying.

Here's a failure mode that shows up in every organization running agents at scale.

Someone builds an agent for a recurring job — a nightly data sync, a weekly dependency audit, a daily digest. It works on the demo input, it goes on a cron schedule, and everyone moves on. Nobody watches it run. Three weeks later the digest has been quietly wrong since the second run, because a source changed shape and the agent started guessing. Nobody caught it, because nobody wrote down what "correct" meant in a form that gets checked on every run.

That's not a model problem. The model is fine. It's a process gap: the agent was shipped with no machine-checkable definition of done.

Anthropic recently open-sourced a Claude Code skill called launch-your-agent that takes you from an idea to a live, deployed agent on Claude Managed Agents (CMA) — interview, launch, grade, schedule. It's a reference implementation: Apache 2.0, three commits, not maintained. For an individual it's a nice on-ramp. For a team, the useful part is the one thing it makes non-optional — every run gets graded against a rubric you defined up front. This post is about why that matters for organizations, when Managed Agents is actually the right tool, and how to wire the pattern in with a worked example.

What the repo is, in team terms

launch-your-agent isn't a framework. It's a workflow encoded as a skill — a SKILL.md and a few reference files under .claude/skills/. You clone the repo, open Claude Code inside it, and run /launch-your-agent. The skill runs a four-phase session: interview to scope the smallest version that does the job, stage and launch it into your own Console account, grade it against a rubric, and — if the job repeats on a clock — put it on a scheduled deployment.

The output that matters for a team is the my-agent/ folder it writes to disk: the build sheet, the exact API payloads, a resumable launch.sh, an eval scaffold, and a NEXT-DIRECTIONS.md laying out v1/v2 as a numbered plan. That folder is committable. It's a versioned, reviewable design copy of a production agent — the opposite of an agent that lives in one engineer's terminal history and can't be handed off.

When your team should actually use Managed Agents

Before the how, the honest when. Managed Agents is one of two ways to build on Claude, and it's the wrong choice for plenty of workloads. Anthropic is specific about this in the overview docs, and it's worth reading before you commit a team to it.

Managed Agents is built for workloads that need:

• Long-running execution — tasks that run for minutes or hours across many tool calls.

• Cloud or self-hosted sandboxes — secure execution with pre-installed packages and network rules, on Anthropic's infra or your own for data-residency needs.

• Minimal infrastructure — you don't build the agent loop, the sandbox, or the tool-execution layer.

• Stateful sessions — persistent filesystem and conversation history across interactions, resuming cleanly after pauses.

• Scheduled execution — recurring runs on a cron schedule via deployments.

Reach for something else when your workload doesn't match:

The sweet spot for an organization is the unattended, recurring work where all the "use it" criteria line up at once: long-running, scheduled, stateful, and not worth building infrastructure for. The nightly sync. The weekly scan. The daily digest. And that's exactly the category where nobody is watching each run — which is why grading is the part you can't skip.

The thing teams skip: a definition of done

For an interactive tool, a human is the grader. They read the output, notice it's off, and re-prompt. For an agent running on a schedule at 3 a.m., there is no human in the loop. The only thing standing between "produced output" and "produced correct output" is whatever check runs automatically.

launch-your-agent makes that check the default. Its kickoff isn't "here's the task, go." It's an Outcome: a task plus a rubric, plus a grader that reads the output and returns pass or needs-revision. The skill's own description of it is the whole thesis in one line:

the Outcome — your definition of done, the checklist every run is graded against.

For a team, the rubric is more than a quality gate. It's a shared, reviewable artifact. It lives in the repo next to the agent config. A new engineer reads it to understand what the agent is supposed to guarantee. It gets reviewed in a PR like any other code. When the agent's output is disputed — "why did it flag this?" — the rubric is the answer, in writing, versioned. That's the difference between an agent a team owns and an agent one person babysits.

How the outcome loop works

The mechanism is a single event. After a session is created, instead of sending a plain user.message, you send user.define_outcome, and the agent works against a grader from the start.

The event carries three fields:

• description — the task in plain language.

• rubric — required. Markdown with explicit, binary per-criterion checks. Inline as {"type": "text", "content": "…"} or a reference to an uploaded file.

• max_iterations — optional, default 3, max 20.

When the agent produces output, a grader — auto-provisioned with its own separate context window, so it isn't swayed by the agent's own reasoning — evaluates the result against your rubric and returns one of:

The loop is the payoff. The agent doesn't hand you one plausible answer and stop. It gets graded, reads why it fell short, and revises up to the cap. Three or four binary criteria turn "looks about right" into "passed criterion 2, failed criterion 3 because the summary omitted the dependency change" — a signal an on-call engineer can act on without re-reading the whole run.

One detail that keeps this cheap for a team: sharpening the rubric and changing the agent are separate operations. Tightening a criterion means editing the rubric and re-running — no new agent version. Changing instructions or tools mints a new agent version you can pin, diff, and roll back. So the team iterates on its definition of done for free, and only version-bumps when the agent itself changes.

A real example: a team's weekly dependency audit

The clearest team workload is the one Anthropic uses as the running example in its scheduled-deployments docs: a weekly compliance scan on a cron schedule. Let's build a concrete version of it — a security-and-dependency audit that runs every Monday morning, grades itself against a rubric your team reviews, and files a report — and wire it two ways.

Path A — let the skill build it with you

You need Claude Code installed and signed in, and an Anthropic API key from your own account (platform.claude.com → API keys). Runs cost cents.

``bash

git clone https://github.com/anthropics/launch-your-agent

cd launch-your-agent

claude

Then, inside Claude Code:

/launch-your-agent

It opens with a few example archetypes — a data analyst that turns a CSV into a report, an ops responder that investigates an alert and waits for approval, a recurring compliance scan — and one open question about what you want to build. Describe the weekly audit; it scopes a v0, writes the whole build kit, and stages everything before it ever asks for your key. Run /wrap-up at the end for a primitives recap and the overview page.

Read the SKILL.md while you're in there. Its operating rules — "change one thing per iteration," "v0 is the few features that make the job work," "no literal dates in a scheduled task" — are transferable team discipline whether or not you use CMA.

Path B — build it with the API, step by step

This is the part a team commits to the repo. The Python SDK sets the anthropic-beta: managed-agents-2026-04-01 header for you.

Step 1 — Define the agent. Model, instructions, and the prebuilt toolset (bash, read, write, grep, web_fetch, and the rest). Note the explicit never-do: agents on a schedule are the ones most likely to hallucinate a CVE, so the rubric will check for it and the system prompt forbids it.

`python

from anthropic import Anthropic

client = Anthropic()

agent = client.beta.agents.create(

name="Weekly Dependency Auditor",

model="claude-opus-4-8",

system=(

"You audit a repository's dependency changes over the last 7 days. "

"Write your report to /mnt/session/outputs/report.md. "

"Never invent a CVE ID or a version number you did not read from a real source."

tools=[{"type": "agent_toolset_20260401"}],

)

Step 2 — Define the environment. A cloud sandbox with the tooling the job needs pre-installed. For a security scan you'd lock networking down to the hosts the agent legitimately needs; start unrestricted while you develop, then tighten.

`python

env = client.beta.environments.create(

name="audit-env",

config={

"type": "cloud",

"packages": {"pip": ["requests"], "npm": ["npm-check-updates"]},

"networking": {"type": "unrestricted"},

)

Step 3 — Start a session. One running instance of the agent inside that environment.

`python

session = client.beta.sessions.create(agent=agent.id, environment_id=env.id)

Step 4 — Write the rubric. This is the artifact your team reviews. Keep the criteria binary — each one is either true or false about the output, so the grader (and a human reading the PR) can't argue about it.

`python

rubric = """

Definition of done — weekly dependency audit

Grade the report at /mnt/session/outputs/report.md against each criterion. Binary.

Coverage — every dependency changed in the last 7 days appears in the report.

Sourced — every CVE or advisory cited links to a real, reachable source. No invented IDs.

Severity — findings are grouped must-fix / should-fix / nit, each with a one-line reason.

Actionable — each must-fix names the exact version to upgrade to.

"""

Step 5 — Kick it off as an Outcome, not a bare message. This is the line that separates a demo from something you can leave running.

`python

client.beta.sessions.events.send(

session.id,

events=[{

"type": "user.define_outcome",

"description": "Audit dependency changes in this repo over the last 7 days.",

"rubric": {"type": "text", "content": rubric},

"max_iterations": 3,

}],

)

Step 6 — Read the verdict, not the vibes. The agent runs, gets graded against your four checks, and revises up to three times. Fetch the result from the session rather than skimming the output:

`python

result = client.beta.sessions.retrieve(session.id)

for ev in result.outcome_evaluations:

print(ev.result, "—", ev.explanation)

satisfied means it passed your rubric — not that it looked fine at a glance.

Step 7 — Put it on a schedule. Because the audit is weekly, it belongs on a scheduled deployment; CMA has cron built in, so there's no external scheduler to run. The rule the skill hammers on, and the single most common way a scheduled agent silently rots: the task text must use relative dates ("the last 7 days as of this run"), never a hard-coded date, because the same initial_events fire on every trigger.

`python

deployment = client.beta.deployments.create(

name="weekly-dependency-audit",

agent=agent.id,

environment_id=env.id,

initial_events=[

{"type": "user.message", "content": [{"type": "text",

"text": "Audit dependency changes over the last 7 days as of this run."}]},

{"type": "user.define_outcome",

"description": "Weekly dependency audit.",

"rubric": {"type": "text", "content": rubric},

"max_iterations": 3},

schedule={"type": "cron", "expression": "0 9 1", "timezone": "America/New_York"},

)

Fire one manual run now so the team sees it work before trusting the cron.

client.beta.deployments.run(deployment.id)

Every Monday at 9 a.m. it launches a session, runs the audit, grades itself against the rubric, and files the report — unattended. Because the rubric rides along in initial_events, every scheduled run is held to the same standard as the one an engineer watched.

Step 8 — Deliver it where the team already looks. The report lands in the session's outputs; you fetch it via the Files API with scope_id=

Steal the pattern even without CMA

If your team isn't on Managed Agents, the discipline still ports. The shape is:

Write success as a short list of binary criteria — before the prompt, not after. If you can't, the task isn't specified yet.

Grade with a fresh context — a separate model call that sees only the rubric and the output, not the agent's reasoning, catches self-justification a single pass won't.

Loop on the verdict — feed the failure explanation back and revise a bounded number of times.

Keep the first passing output as a regression baseline so the next version has something to beat.

Those four steps work on top of Claude Code, a plain SDK loop, or a cron job you already have. They're the same idea as spec-driven agent development: the rubric is the spec, made executable.

What this teaches an organization

Stripped of the CMA specifics, launch-your-agent is a compact course in shipping agents a team can trust, taught by example in its own SKILL.md:

• Scope a v0, defer the rest on purpose. One agent, full toolset, drafts-only, three iterations. Everything bigger goes into NEXT-DIRECTIONS.md as a numbered version with the exact mechanism attached — "not yet" always ships with "and here's how."

• Change one variable per iteration so you can attribute the result — sharper rubric, or new instructions, or tighter task, never all three at once.

• Version the agent, not the rubric. Free iteration on the definition of done; pinned, rollback-able versions when the agent changes.

• Held-back eval cases are the regression check. The rubric grades this run; a set of known-good cases you didn't tune against tells you a new version didn't quietly break the old behavior — the agent equivalent of a test suite before you promote a deploy.

• Relative dates for anything on a schedule. The failure mode this whole post opened with.

None of those are product features. They're the habits that separate teams whose agents stay correct from teams whose agents demo well and drift.

The Bottom Line

For an organization, the model was never the bottleneck, and now the agent loop isn't either — Managed Agents runs it, in four API calls, for the long-running scheduled work it's built for.

What's left is the part that's actually yours: deciding, in writing, what "working" means for each agent, and checking it on every run — especially the runs no human watches. launch-your-agent` makes that the default instead of the exception, and it puts the definition of done in a file your team can review and own.

Use Managed Agents where it fits — long-running, scheduled, stateful, no infra worth building. Skip it where a plain API call or Claude Code is the better tool. But wherever you run agents, the first question worth answering isn't "which model." It's "how will we know it's done?" — and the answer should be executable.

Related: Everyone's Building Agent Loops From Scratch. Anthropic Just Shipped the Entire Stack. covers the Managed Agents runtime this builds on, and Spec-Driven Agent Development makes the case that the spec — here, the rubric — is the real work.