Without guardrails, a single misconfigured agent in a loop can exhaust a month's API budget overnight. Here's the three-level defense — per-request caps, in-code token budgets, and dashboard spend limits — with working code.
In November 2025, two LangChain agents — an Analyzer and a Verifier — were deployed to help a team automate research. They worked fine in testing. The team shipped them and moved on.
Eleven days later, someone checked the billing dashboard.
The agents had been talking to each other continuously since launch. A classification error on day one caused the Verifier to return "retry with different parameters" — and the Analyzer had been dutifully retrying ever since. 11 days. 2.4 million API calls. $47,000.
Nobody noticed because the process appeared healthy. No crashes. No alerts. Just two agents in a loop, burning $4,000 a day in silence.
This isn't a horror story about bad engineers. It's what happens when agentic systems run without cost guardrails. And as teams move from "one-shot prompts" to "agents that run loops and call tools," the exposure grows exponentially.
The fix isn't complicated. It's three configuration changes, and most teams skip all three.
Why Agentic Loops Are a Different Risk Category
A regular API call has a known cost ceiling. A 1,000-token request at Sonnet rates costs roughly $0.015. You can reason about that.
An agent loop has no natural ceiling. The loop math works like this: each iteration adds to the context window (the growing conversation history gets sent on every call), the model reasons about what went wrong and generates a new plan, and if that plan also fails, the cycle repeats — with a slightly larger context each time, making each call marginally more expensive than the last.
A loop that runs 1,000 iterations isn't 1,000x the cost of one call. It's closer to 1,000x the average cost of a call, where "average" is pulled upward by the compounding context. A 10,000-token context on iteration 1 becomes 50,000 tokens by iteration 200.
The $47,000 incident involved just two agents. Teams running multi-agent pipelines — orchestrators spawning sub-agents, sub-agents calling tools that spawn more sub-agents — have surface area that compounds even faster.
96% of enterprises report AI costs exceeding initial estimates. 40% of agentic AI projects fail partly due to hidden cost overruns. The root cause in nearly every case: the team designed for what happens when the agent succeeds, not what happens when it gets stuck.
!A billing dashboard showing a $47,000 balance after an agent ran unchecked for 11 days
The Three-Level Defense
Cost control for agentic systems works best as defense in depth. Each layer catches what the layer above misses.
Level 1: Cap output at the API call level. The lowest-friction fix. It won't stop a runaway loop, but it prevents any single call from generating an unexpectedly large output.
Level 2: Track and enforce a token budget in code. This is the layer that actually stops runaway loops. Your agent wrapper tracks cumulative token usage across all calls in a task, and halts before exceeding a threshold.
Level 3: Set hard monthly/workspace spend limits in the Anthropic Console. The backstop. No matter what happens in your code, the platform enforces a ceiling.
Each layer alone is insufficient. A per-call cap doesn't stop 10,000 cheap calls. A code budget can be bypassed if the wrong code path is triggered. A dashboard limit only fires after the damage is done. Together, they make a runaway loop an engineering problem with a known worst-case cost, not a surprise.
!An office worker trying to carry a stack of documents that keeps growing, representing compounding context costs in an agent loop
Level 1: Per-Request max_tokens Cap
Every Claude API call should include an explicit max_tokens. The default behavior lets the model respond as verbosely as it wants — which is fine for interactive chat, but dangerous for agents that call tools and loop.
``python
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024, # Hard cap per response
messages=[
{"role": "user", "content": "Analyze this dataset and return your findings."}
]
)
`
A cap of 1024–2048 tokens covers the vast majority of agent reasoning steps. If a response legitimately needs more (e.g., generating a long artifact), you can raise it per task type — but the point is to set it explicitly for each call rather than relying on the model's default behavior.
When a response hits the max_tokens limit, the API returns stop_reason: "max_tokens" instead of "end_turn". Your agent should treat this as a signal to log a warning, not to retry.
`python
if response.stop_reason == "max_tokens":
logger.warning(f"Response truncated at max_tokens. Task: {task_id}")
# Don't retry — truncated reasoning is still usable reasoning
return response.content[0].text
`
!A close-up of a hand turning a physical dial labeled MAX_TOKENS, setting a hard output limit
Level 2: Per-Task Token Budget in Code
A token budget wrapper tracks cumulative token usage across all calls within a single agent task. When the task exceeds its budget, it stops rather than continuing to retry.
`python
import anthropic
import logging
logger = logging.getLogger(__name__)
class TokenBudget:
def __init__(self, task_id: str, max_tokens: int):
self.task_id = task_id
self.max_tokens = max_tokens
self.used_tokens = 0
self.client = anthropic.Anthropic()
def call(self, model: str, messages: list, max_response_tokens: int = 1024) -str:
remaining = self.max_tokens - self.used_tokens
if remaining <= 0:
raise BudgetExceededError(
f"Task {self.task_id} exceeded {self.max_tokens} token budget "
f"after {self.used_tokens} tokens used."
)
response = self.client.messages.create(
model=model,
max_tokens=min(max_response_tokens, remaining),
messages=messages
)
self.used_tokens += response.usage.input_tokens + response.usage.output_tokens
logger.info(f"Task {self.task_id}: {self.used_tokens}/{self.max_tokens} tokens used")
return response.content[0].text
@property
def budget_remaining_pct(self) -float:
return (self.max_tokens - self.used_tokens) / self.max_tokens 100
class BudgetExceededError(Exception):
pass
`
Usage in an agent loop:
`python
budget = TokenBudget(task_id="research-agent-run-123", max_tokens=50_000)
for step in range(MAX_STEPS):
try:
result = budget.call(
model="claude-sonnet-4-6",
messages=conversation_history
)
if budget.budget_remaining_pct < 20:
logger.warning(f"Task {budget.task_id} at 80% budget — forcing early exit")
break
# Continue loop logic...
except BudgetExceededError as e:
logger.error(str(e))
# Return partial results, escalate to human, or fail gracefully
break
`
The critical detail: BudgetExceededError should never trigger a retry. A retry on budget exhaustion is just a loop that burns twice the tokens. Catch it, log it, escalate it.
What's a reasonable budget per task? A useful starting point:
These are starting points. Instrument your agents, measure actual p95 usage, then set the budget at roughly 2x p95. You want headroom for legitimate edge cases without leaving the door open for infinite loops.
!A digital dashboard showing a fuel gauge for an AI task budget, with a warning zone highlighted at 80%
Level 3: Workspace Spend Limits in Anthropic Console
The in-code token budget covers runtime loops. But it doesn't protect against:
• A deployment bug that bypasses your budget wrapper
• A misconfigured cron job that runs 100 parallel agent sessions
• An employee testing with production credentials
The Anthropic Console lets you set hard monthly spend limits at the organization and workspace level. When the limit is hit, the API returns an error — no more calls until the next calendar month (or until you manually raise the limit).
Setup steps:
Log in to console.anthropic.com
Navigate to Settings → Usage limits
Set Monthly spend limit for your organization
If you use multiple workspaces (recommended for staging/prod separation), navigate to each workspace under Workspaces and set per-workspace limits
The typical configuration for a small production team:
• Production workspace: 80% of your expected monthly budget
• Staging workspace: 15% of your expected monthly budget
• Dev workspace: 5% of your expected monthly budget
The percentages won't hold for every team — adjust based on your actual usage split. The important part is that each workspace has a ceiling, not just the organization.
Alert thresholds: Anthropic sends email alerts at 75% and 90% of your spend limit by default. You can supplement this by polling the usage API in your monitoring system and alerting at 50% (warning) and 80% (critical). That gives you a window to investigate before the hard limit fires.
`bash
Poll usage API (useful for custom alerting)
curl https://api.anthropic.com/v1/usage \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01"
`
!A person in a high-tech control room pulling a large emergency stop lever, representing a hard platform spend limit
Bonus: Route Tasks to the Right Model
The three levels above prevent runaway spend. Model routing reduces your baseline spend — which directly lowers the blast radius of any incident that does slip through.
As of April 2026, Anthropic's three production models have significantly different cost profiles:
The default mistake is using Sonnet for everything. A well-routed system uses Haiku for decisions that don't require reasoning depth (is this request valid? which tool should I call? summarize this output for the next step?) and reserves Sonnet or Opus for the actual work.
A simple routing pattern:
`python
def select_model(task_type: str) -str:
HAIKU = "claude-haiku-4-5-20251001"
SONNET = "claude-sonnet-4-6"
OPUS = "claude-opus-4-6"
routing = {
"classify": HAIKU,
"extract": HAIKU,
"summarize": HAIKU,
"route": HAIKU,
"generate_code": SONNET,
"analyze": SONNET,
"plan": SONNET,
"reason_complex": OPUS,
"architect": OPUS,
}
return routing.get(task_type, SONNET) # Sonnet as safe default
`
Routing 30–40% of agent calls to Haiku can reduce baseline costs by 50–60%, depending on your workload mix. Lower baseline = lower maximum exposure in a loop incident.
Try It Yourself
You can add all three layers to an existing agent in under an hour. Here's the sequence:
Step 1 — Add max_tokens to every API call (5 minutes)
Search your codebase for client.messages.create calls that are missing max_tokens. Add it to each one. Start conservative: 1024 tokens covers most agent reasoning steps.
`bash
Find API calls that might be missing max_tokens
grep -rn "messages.create" . --include=".py" | grep -v "max_tokens"
`
Step 2 — Wrap your agent loop with TokenBudget (20 minutes)
Copy the TokenBudget class above into your agent utilities. Wrap your main agent loop. Set the budget based on what you observe in practice — if you don't have data yet, start at 50,000 tokens for a typical task and adjust after a week of production usage.
Step 3 — Set workspace spend limits in Anthropic Console (5 minutes)
Go to console.anthropic.com/settings/limits. Set your monthly org limit. If you haven't created separate staging and production workspaces yet, do that first — it takes less than a minute and gives you spend isolation between environments.
Step 4 — Add 50% and 80% alert thresholds (10 minutes)
Anthropic sends alerts at 75% and 90% by default. Add a monitoring job that polls the usage API daily and fires a Slack alert at 50% spend. The 50% alert is where you investigate; the 75% alert from Anthropic is where you should already know what's happening.
`python
Example: daily usage check (add to your monitoring cron)
import anthropic
import httpx
def check_spend_alert(threshold_pct: float = 0.5):
# Fetch current usage via Anthropic Console API
# Alert your team if spend_pct threshold_pct
usage = fetch_current_usage() # Your implementation
if usage.spend_pct threshold_pct:
send_slack_alert(
f"API spend at {usage.spend_pct:.0%} of monthly limit "
f"(${usage.current:.0f} of ${usage.limit:.0f})"
)
``
Step 5 — Audit model usage (30 minutes)
Log the model name on every API call alongside the token count. After a week, look at the distribution. Any call using Sonnet for classification, routing, or summarization is a candidate to move to Haiku. Most teams find 25–40% of their Sonnet calls can move down.
The Incident That Didn't Happen
The teams who implement these three layers don't have $47,000 incidents. They have $2,000 incidents — or $200 ones. The spend limit fires, an alert goes out, someone investigates, the bug gets fixed.
The cost of not doing this is unlimited. Literally — an uncapped agent loop running on a Tier 4 account has no platform-enforced ceiling.
Five minutes to set a workspace limit. Twenty minutes to add a token budget. The asymmetry is not subtle.
Agent systems fail in production. That's a given. The question is whether they fail safely — with a known maximum cost and an alert that fires before anyone has to check the billing dashboard eleven days later.