Anthropic Built a Second AI to Watch the First One. That's What Auto Mode Actually Is.

By Haim Ari · 2026-03-26T09:00:00

Claude Code's new Auto Mode isn't just a convenience toggle—it's a second classifier model running in the background, reviewing every action your AI agent takes before it executes. Here's what the architecture reveals about where agentic development is heading.

Every developer who has used Claude Code long enough has faced the same decision.

You're three hours into a refactor that touches forty files. Claude is doing good work, but every two minutes it stops and asks: Can I run this shell command? Can I edit this file? Can I make this network request? You've said yes forty times in a row. You know what it's going to do. You want it to just keep going.

So you do what experienced Claude Code users do: you enable --dangerously-skip-permissions and let it run.

The flag name is not subtle. Anthropic named it that on purpose. And yet it became the de facto solution for anyone doing serious long-running work with Claude Code, because the alternative—getting interrupted every ninety seconds—made autonomous development effectively impossible.

On March 24, 2026, Anthropic shipped the fix. But "fix" undersells what they actually built.

The Problem With the Binary Choice

Before Auto Mode, Claude Code offered you a choice that wasn't really a choice: full supervision or full autonomy.

Default mode means Claude asks before every action. You review each edit, approve each shell command, confirm each network request. For exploratory work on unfamiliar code, this is valuable. For a six-hour refactoring session on code you own? It's a session killer.

bypassPermissions (the mode that --dangerously-skip-permissions maps to) flips the switch entirely. Claude executes everything—file edits, shell commands, network requests, anything—without asking. No checks, no guardrails, no oversight. It's faster, but if Claude misreads an instruction and decides to rm -rf something, nothing stops it.

The safety advice was always "only use this in isolated containers and VMs." Realistically, developers were running it on their actual machines, on actual codebases, because the work required it.

Auto Mode Is Not What the Name Implies

When I first read the announcement, I expected Auto Mode to be a smarter version of acceptEdits—maybe some heuristics about which commands looked safe. What Anthropic actually built is architecturally more interesting.

Auto Mode runs a second AI on every action before it executes.

Not a rule engine. Not a pattern matcher. A separate classifier model—Claude Sonnet 4.6—that receives the conversation context and the pending action, reasons about whether it matches what you asked for, and decides whether to allow or block it.

The flow looks like this:

You give Claude a task

Claude determines it needs to run a shell command

Before that command executes, the classifier gets the conversation history and the pending command

The classifier evaluates: does this action match what the user asked for, or does it look like an escalation, a mistake, or something driven by hostile content Claude encountered?

If safe: the command runs without interrupting you

If risky: Claude is told why it was blocked and must find a different approach

The cost implication is real—classifier calls count toward your token usage. But the tradeoff is clear: you pay a small premium on shell commands and network requests, not on file reads or edits in your working directory (those are auto-approved without a classifier call).

What "Trust Boundary" Actually Means

The classifier doesn't try to evaluate every action in the abstract. It works from a concrete trust model.

Trusted by default:

• Your working directory and its contents

• Your git repo's configured remotes

• Installing dependencies already declared in your lock files

• Read-only HTTP requests

• Pushing to the branch you started on (or one Claude created)

Not trusted until you say so:

• Your company's other GitHub orgs

• Internal services and APIs

• Cloud storage buckets

• Production infrastructure

This distinction matters. If you're working in a personal project and ask Claude to push to your repo, it'll do it. If you're in a work repo and Claude tries to push to a different org's remote, the classifier treats that as external and blocks it.

Everything else falls into a list of defaults that will block:

• curl | bash patterns (downloading and executing code)

• Sending data to external endpoints

• Production deploys and migrations

• Mass deletion on cloud storage

• Granting IAM or repo permissions

• Modifying shared infrastructure

• Force push or pushing directly to main

• Irreversibly destroying files from before the session

You can see the full default rules with claude auto-mode defaults. And if your team needs to add trusted infrastructure—your company's S3 bucket, your internal GitLab instance, your staging environment—admins can configure that via autoMode.environment in managed settings.

The Part About Subagents That Changes Everything

The most interesting piece of the Auto Mode architecture isn't what it does with single actions—it's what it does with subagents.

When Claude spawns a subagent (to parallelize work, to delegate a specific task, to run a specialized workflow), two checks happen:

Check 1: At spawn time. The classifier reviews the task description before the subagent starts. A task like "delete all remote branches matching this pattern" gets blocked before the subagent ever runs.

Check 2: After completion. When the subagent finishes, the classifier reviews its entire action history. The reasoning: a subagent that looked safe when spawned could have been compromised mid-run by hostile content it encountered in a file or web page.

If the post-run check flags something, a security warning is prepended to the subagent's results. The main agent can then decide how to proceed with the context that this subagent's results may not be trustworthy.

This is prompt injection defense baked into the agent orchestration layer. It's not perfect—the announcement acknowledges the classifier will sometimes miss things and sometimes block things it shouldn't—but it's a fundamentally different approach to agentic security than "hope Claude doesn't read anything malicious."

How to Actually Enable It

Auto Mode is currently in research preview on Team plans, with Enterprise and API rollout happening in the days following launch. Haiku and Claude 3 models don't support it—you need Sonnet 4.6 or Opus 4.6.

The setup path depends on how you're running Claude Code:

In the CLI:

`bash

Enable for a single session

claude --enable-auto-mode

Set as your default

Add to your settings file:

{

"permissions": {

"defaultMode": "auto"

}

During a session, cycle with Shift+Tab

default → acceptEdits → plan → auto

In VS Code:

Enable "Allow dangerously skip permissions" in the extension settings (I know, confusing name), then select "Auto" in the mode dropdown.

In the Desktop app:

Enable in Desktop settings, then use the mode selector next to the send button.

On Team and Enterprise, an admin must first enable it in Claude Code admin settings before users can turn it on. For orgs that want to prevent it entirely, permissions.disableBypassPermissionsMode: "disable" in managed settings blocks it.

The Fallback Design Tells You Something Important

One architectural decision worth noting: what happens when the classifier keeps blocking things?

If the classifier blocks an action 3 times in a row, or 20 times total in a session, Auto Mode pauses and Claude Code falls back to prompting for each action. These thresholds aren't configurable.

The reasoning: repeated blocks usually mean one of two things. Either the task genuinely requires actions the classifier won't allow (in which case you should probably think twice about doing them autonomously). Or the classifier is missing context about your trusted infrastructure and treating safe actions as risky—in which case, configure your trusted environments.

This fallback-to-prompt design is careful. In interactive mode, you see a notification and can approve the blocked action to reset the counters. In non-interactive mode (-p flag), the session aborts entirely—there's no user to prompt.

This matters for teams thinking about using Auto Mode in CI pipelines. It's not designed for unattended production automation. It's designed for the developer sitting at their machine who wants to stop approving obvious things while retaining oversight for the genuinely risky stuff.

What This Architecture Reveals

The interesting question isn't "is Auto Mode safe enough to use." It's: why did Anthropic build it this way?

They could have built a rule-based system—a list of patterns that are auto-approved and a list that are blocked. Many teams have built exactly that kind of wrapper around --dangerously-skip-permissions. It would have been faster to ship and easier to reason about.

Instead, they built a second model that reasons in prose about context and intent. The classifier doesn't match patterns; it evaluates whether an action is "an overeager escalation beyond what you asked for, a mistake about what's safe to touch, or a sudden departure from your stated intent that suggests Claude may have been steered by something it read."

That framing—three distinct failure modes, each requiring different reasoning—tells you something about how Anthropic thinks about agentic safety at scale. Pattern matching can stop rm -rf /. It can't distinguish between Claude appropriately pushing to a release branch and Claude being steered by a malicious file to push to main.

The other revealing detail: the classifier never sees tool results. It only sees your messages and Claude's tool calls—not the content of files Claude read, web pages it visited, or command outputs. That's not an accident. It means hostile content in the environment can't manipulate the classifier directly, even if it could manipulate Claude's behavior.

The Bottom Line

Auto Mode solves the actual problem that --dangerously-skip-permissions was a workaround for: how do you let an AI agent work autonomously on a long task without either drowning in approval dialogs or throwing away all oversight?

The answer Anthropic landed on is architecturally honest: you don't eliminate oversight, you automate it with a second model. A model that never sees the environmental content that could compromise the first one, that checks both at spawn time and at completion for subagents, and that falls back to human review when it's uncertain.

This won't be the last iteration. The announcement explicitly calls it a research preview. Classifier accuracy will improve. Trusted infrastructure configuration will get easier. The fallback thresholds will probably become configurable.

But the architecture—two models in a trust relationship, with environmental isolation between them—is almost certainly the direction the whole agentic stack is heading.

The era of "just add --dangerously-skip-permissions` and hope for the best" is ending. What replaces it is more interesting.

Want to see how this fits with the broader agentic security picture? The hooks-based guardrail system I wrote about in Your AI Does Exactly What You Ask. That's the Problem. is the complementary piece — hooks give you deterministic pre/post checks that Auto Mode's classifier can't provide.