Teams running agentic systems on a single model pay 3–7x more than they need to. A three-tier routing strategy — Haiku for cheap tasks, Sonnet for execution, Opus for orchestration — cuts API costs 70–90% with no quality loss. Here's the routing logic with actual code.
Here's a bill most engineering teams don't expect: you deploy an agent that handles customer questions, runs background analysis, and occasionally writes code. It costs $800 a month in testing. You scale it up. Now it costs $6,400 a month. Nothing changed except volume.
That 8x cost jump is almost never about volume alone. It's about model selection — or the lack of it.
Most teams pick one model and route everything through it. Sonnet for everything, or Opus for "the important stuff" which somehow becomes everything. The agent classifies a ticket (3 sentences in, 1 sentence out), runs a multi-step code review (complex reasoning, 2,000 tokens out), and summarizes a Slack thread (simple extraction, 200 tokens out) — all on the same model, at the same price.
That's like renting a forklift to carry a coffee cup because you also need to move pallets.
The Three Tiers, and What They Actually Cost
Claude's model family in 2026 has three distinct tiers. Here's what matters for cost routing:
The output price is where the real money goes — agents generate far more output tokens than they consume input. Haiku at $5/M output versus Opus at $25/M output is a 5x difference per task.
Now do the math on a real workload. Say your agent team generates 100 million output tokens a month (a realistic number for a team running background pipelines):
• All Opus: 100M × $25/M = $2,500/month
• Smart routing (70% Haiku, 20% Sonnet, 10% Opus):
• 70M × $5 = $350
• 20M × $15 = $300
• 10M × $25 = $250
• Total: $900/month
That's a 64% cost reduction before you touch caching or the Batch API. Add prompt caching (90% discount on repeated system prompts hitting Haiku) and you're at $750/month — a 70% reduction, without changing a single line of agent logic.
!Three-tier cost comparison: Haiku, Sonnet, Opus pricing tiers shown as a routing pyramid with example tasks at each level
The Routing Decision: What Goes Where
The mistake most teams make is treating model selection as a one-time deployment decision. It should be a per-task runtime decision.
Here's the decision framework:
Use Haiku when:
• The task is a yes/no or category classification
• You're extracting structured data from unstructured text
• You're summarizing content under 2,000 words
• You're doing high-frequency operations (classifying every incoming request, routing tickets, tag extraction)
• Latency matters more than nuance
• You need to call the model more than 50 times per user action
Use Sonnet when:
• You're writing or reviewing code (Sonnet 4.6 scores 79.6% on SWE-bench Verified)
• You're analyzing requirements or writing tests
• You need multi-step reasoning but not novel problem-solving
• You're doing the bulk of your agent execution tasks
• You want the best cost-to-quality ratio for "standard" work
Use Opus when:
• You're designing multi-agent architectures or long-horizon plans
• An agent is orchestrating other agents
• The task requires genuine novel reasoning that Sonnet gets wrong
• The cost of an error is high (security analysis, architectural decisions, complex refactors)
• You're dealing with context that exceeds Sonnet's reliable reasoning range
A practical rule of thumb: Sonnet handles 80% of what you actually need. Haiku handles 15% (the cheap, high-frequency tasks). Opus handles the remaining 5% (the expensive, high-stakes tasks). If your Opus usage is above 20% of your total calls, something is miscategorized.
!Decision flowchart showing routing logic: classification/extraction → Haiku, standard execution → Sonnet, orchestration/complex reasoning → Opus
Try It Yourself
The Routing Wrapper
Here's a minimal router you can drop into any agent system. It takes a task description and returns the right model ID:
``python
from anthropic import Anthropic
HAIKU = "claude-haiku-4-5-20251001"
SONNET = "claude-sonnet-4-6"
OPUS = "claude-opus-4-6"
Classification keywords that signal Haiku-appropriate tasks
HAIKU_SIGNALS = [
"classify", "extract", "summarize", "tag", "route",
"categorize", "label", "detect", "filter", "score"
]
Signals for Opus-level tasks
OPUS_SIGNALS = [
"architect", "design system", "orchestrate", "refactor entire",
"novel", "multi-agent", "long-horizon", "security review"
]
def route_model(task_description: str) -str:
"""Return the optimal model ID for a given task description."""
task_lower = task_description.lower()
if any(signal in task_lower for signal in OPUS_SIGNALS):
return OPUS
if any(signal in task_lower for signal in HAIKU_SIGNALS):
return HAIKU
return SONNET # Default: Sonnet handles the majority
def run_task(task_description: str, prompt: str) -str:
"""Run a task on the optimal model."""
client = Anthropic()
model = route_model(task_description)
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
print(f"[routing] {task_description[:40]} → {model.split('-')[1]}")
return response.content[0].text
`
Usage:
`python
This runs on Haiku (~$0.005 per call)
result = run_task(
"classify support ticket",
"Classify this ticket as billing, technical, or general: 'I can't log in'"
)
This runs on Sonnet (~$0.015 per call)
result = run_task(
"write unit tests for auth module",
"Write pytest tests for this login function: ..."
)
This runs on Opus (~$0.025 per call)
result = run_task(
"architect a multi-agent data pipeline",
"Design the agent topology for processing 10K documents/hour..."
)
`
Add Prompt Caching (90% Off Repeated Prompts)
If your agents use a consistent system prompt (they almost certainly do), prompt caching on Haiku cuts input costs to $0.10/M on cache hits — a 90% discount:
`python
response = client.messages.create(
model=HAIKU,
max_tokens=512,
system=[
{
"type": "text",
"text": YOUR_SYSTEM_PROMPT, # Your shared system prompt
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[{"role": "user", "content": task_input}]
)
`
The cache persists for 5 minutes. For high-frequency classification tasks hitting the same system prompt, cache hit rates above 80% are typical.
Use the Batch API for Non-Urgent Work
For tasks that don't need a response in under 60 seconds (background analysis, nightly reports, bulk processing), the Batch API cuts prices in half:
`python
import anthropic
client = Anthropic()
Submit a batch of 100 classification tasks
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"task-{i}",
"params": {
"model": HAIKU,
"max_tokens": 256,
"messages": [{"role": "user", "content": tickets[i]}]
}
}
for i in range(len(tickets))
]
)
print(f"Batch submitted: {batch.id}")
Poll for results — typically completes within 1 hour
`
Haiku via Batch API: $0.50/M input, $2.50/M output. That's 10x cheaper than Opus standard pricing for bulk tasks.
Calculate Your Actual Savings
Run this against your existing API logs to see what routing would save you:
`python
def calculate_routing_savings(total_output_tokens: int) -dict:
"""Compare all-Opus vs smart routing cost for a given output token volume."""
# Assume distribution: 70% Haiku, 20% Sonnet, 10% Opus
haiku_tokens = int(total_output_tokens 0.70)
sonnet_tokens = int(total_output_tokens 0.20)
opus_tokens = int(total_output_tokens 0.10)
all_opus_cost = total_output_tokens 25 / 1_000_000
routing_cost = (
haiku_tokens 5 / 1_000_000 +
sonnet_tokens 15 / 1_000_000 +
opus_tokens 25 / 1_000_000
)
return {
"all_opus_monthly": f"${all_opus_cost:,.2f}",
"smart_routing_monthly": f"${routing_cost:,.2f}",
"monthly_savings": f"${all_opus_cost - routing_cost:,.2f}",
"reduction_pct": f"{(1 - routing_cost/all_opus_cost) 100:.0f}%"
}
Example: 100M output tokens/month
print(calculate_routing_savings(100_000_000))
→ monthly savings: $1,600, reduction: 64%
``
Pull your actual output token counts from your Anthropic Console usage dashboard (Settings → Usage), plug in the number, and see where you stand.
!Cost savings dashboard showing before/after routing comparison with monthly savings calculation
The One Thing That Trips Teams Up
The routing logic above is intentionally simple — keyword matching. In practice, you'll want to make it smarter: use a Haiku call itself to classify what model the next call should use (the meta-routing pattern), or add confidence thresholds that escalate to Sonnet if the task description is ambiguous.
But the most common mistake isn't overly simple routing. It's routing that never gets built at all.
Teams start with a prototype using Sonnet, it works, and they never revisit the model choice. By the time someone looks at the billing dashboard and asks "why are we paying this much?", the pattern is baked into six different pipelines.
The routing wrapper above takes 30 minutes to integrate. The savings on a team running 100M output tokens/month pay for that time within the first day.
The Bottom Line
Model selection isn't a deployment decision. It's a per-task decision that compounds with every call your agents make.
The gap between a team that routes intelligently and one that doesn't isn't subtle — it's 64–75% of their API bill, month over month. At scale, that's the difference between a cost center and a sustainable AI engineering practice.
Start with Sonnet as your default. Add Haiku for classification and high-frequency tasks. Add Opus for the genuinely hard problems. Then add prompt caching on your system prompts and the Batch API for non-urgent workloads. That's the full stack, and it fits in an afternoon.
For context on preventing runaway costs in agent loops — a related but distinct problem — see A Buggy Agent Loop Will Drain Your API Budget in Hours.