1 Million Tokens Got the Headline. The Real Fix to Context Rot Is One API Parameter.

By Haim Ari · 2026-03-30T08:30:00

Anthropic's 1M token context window made headlines. But the architectural change that actually matters for long-running agents is hiding in a beta header: compact-2026-01-12. Here's what compaction does, why it's different from a bigger window, and how to use it today.

When Anthropic announced that Claude Opus 4.6 and Sonnet 4.6 would support 1 million token context windows — generally available, no surcharges, no waitlist — the reaction in engineering communities was predictable. A 5x expansion of working memory. Enough space to hold entire codebases, a year of structured logs, or a compliance documentation corpus for a mid-size organization. The posts celebrated.

What got less attention was a quiet beta feature that shipped at the same time: context compaction. You opt in with a single beta header: compact-2026-01-12. And if you're building agents that do real work over long sessions, it's the more important feature.

Here's why the 1M window, for all its impressiveness, doesn't solve the problem it's credited with solving — and what compaction does differently.

The Problem With "Just Give It More Space"

Previous post about context rot covered the mechanics: transformer models weight recent tokens more heavily than earlier ones. As sessions accumulate, architectural decisions made an hour ago compete with everything generated since. At 50% context utilization, quality starts to degrade. At 70%, you see hallucinations and contradictions.

A 1M token window doesn't eliminate this pattern. It moves the cliff.

With a 200K window, an active agent session hits quality degradation somewhere in the first hour of heavy tool use. With 1M tokens, that same degradation happens later — potentially much later. Opus 4.6 scores 78.3% on MRCR v2 at 1M tokens (MRCR tests entity and relationship tracking across enormous contexts), which means the model maintains meaningful coherence across a million tokens better than anyone expected. The cliff isn't as steep.

But it's still there. And when you hit it, you've spent more money getting there.

Processing 1M tokens costs the same per-token as processing 9K tokens — pricing stays flat. But time to first token scales with context length. A session that fills 900K tokens before degrading is not only expensive, it's slow. You've optimized for capacity and paid in latency.

There's also a subtler cost. Research on large context performance consistently finds the same pattern: poorly managed large contexts often underperform smaller, well-curated ones. The model isn't uniformly attentive across a million tokens. It processes the context, but attention isn't uniformly distributed. Relevance gets diluted, not resolved.

The 1M window is genuine progress. For certain workloads — processing entire codebases in a single pass, analyzing large document sets, batch summarization — it removes a real architectural constraint that forced chunking logic and retrieval layers into pipelines that didn't need them.

But for long-running agents doing iterative work, it's the wrong frame for the problem.

!A developer at a desk watching a context window fill up like a progress bar, with quality degrading as it approaches full

What Compaction Actually Does

Compaction is a server-side mechanism. When enabled, the API monitors token usage per turn. When it crosses a configured threshold, it automatically generates a structured summary of the conversation — preserving what matters, discarding the accumulated noise — and replaces the older context with that summary. Subsequent turns continue from the summary, not the full history.

The compaction block becomes the new base. Everything before it is dropped. The model continues the session with a fresh, focused context that includes the substance of what happened without the accumulated weight of every intermediate exchange.

This is meaningfully different from a bigger window. A bigger window holds more tokens; compaction actively curates which tokens remain. The difference between capacity and curation.

What gets preserved by default is determined by the model's summarization. But you can instruct the compaction — telling it what to prioritize. "Focus on preserving code snippets, variable names, and technical decisions." That instruction shapes what the summary retains. For an agent doing code generation, that means the architectural choices, the variable names chosen, the patterns established — precisely the things that context rot erodes.

Since the 1M context window shipped, there's been a 15% reduction in compaction events — the larger window means sessions hit the compaction threshold less frequently. But when compaction does trigger, the mechanism is the same: structured curation, not capacity extension.

!Server-side compaction illustrated as a funnel: a long conversation history enters, a focused summary exits, and the agent continues from a clean slate

Try It Yourself

Enabling compaction requires the beta header and a minimal change to how you structure API calls. Here's a working example for a long-running agent loop:

``python

import anthropic

client = anthropic.Anthropic()

messages = []

def run_agent_turn(user_message: str) -str:

messages.append({"role": "user", "content": user_message})

response = client.beta.messages.create(

model="claude-opus-4-6",

max_tokens=4096,

# Trigger compaction at 150K tokens (default; minimum is 50K)

context_management={

"edits": [

{

"type": "compact_20260112",

"trigger": {

"type": "input_tokens",

"threshold": 120000 # Trigger earlier than default for cleaner summaries

# What to preserve when summarizing

"instructions": "Focus on preserving code snippets, variable names, architectural decisions, and technical constraints. Summarize discussion briefly but keep all code blocks verbatim."

}

]

betas=["compact-2026-01-12"],

messages=messages

)

assistant_message = response.content[0].text

messages.append({"role": "assistant", "content": assistant_message})

return assistant_message

Three parameters matter here:

The trigger threshold. Default is 150,000 tokens; minimum is 50,000. Setting it lower than default — say, 120K — means compaction fires before quality has a chance to degrade significantly. You're trading a slightly more frequent compaction cycle for cleaner summaries of still-coherent context.

The instructions. This is where you tell compaction what to preserve. Without instructions, the model summarizes based on general salience. With instructions tailored to your use case — "preserve all tool call results verbatim" for a data pipeline, "keep the current user's stated goals and all agreed decisions" for a project planning agent — you shape what survives the compaction event.

Pause after compaction. If you want to inspect the summary before the agent continues:

`python

context_management={

"edits": [

{

"type": "compact_20260112",

"trigger": {"type": "input_tokens", "threshold": 120000},

"pause_after_compaction": True # Pause to inspect the summary

}

]

}

When pause_after_compaction is True, the API returns after generating the summary. You can read the compaction block, verify it captured what matters, then continue. Useful during development; not necessary in production once you've tuned the instructions.

For Claude Code users specifically: If you're on Max, Team, or Enterprise plans, Opus 4.6 at 1M context is now the default model. The auto-compact feature triggers at approximately 95% capacity. To disable 1M context entirely and work with the 200K window (where compaction behavior is more predictable), set:

`bash

export CLAUDE_CODE_DISABLE_1M_CONTEXT=1

This removes 1M model variants from the model picker. Some developers prefer this during early compaction tuning — the 200K window has a shorter cycle, which means faster feedback on whether your compaction instructions are working.

!A terminal showing the compaction API call with threshold and instructions parameters highlighted

Compaction vs. Manual Context Management

The GSD workflow solves context rot by never accumulating it: each phase runs in a fresh context window, with structured artifacts as handoffs. The context passed to each agent is intentionally minimal — a plan document, not a conversation history.

Compaction solves context rot reactively: let the session accumulate, then summarize. These aren't competing approaches — they apply to different architectures.

Interactive coding sessions that span hours benefit from compaction — there's no natural "phase" to restart at. Agent pipelines doing multi-step tool use over a single task benefit from compaction. RAG-heavy workflows processing large document sets may not need either.

The practical guidance: if your agent sessions regularly hit 100K+ tokens doing iterative work, add compaction. If you're building a structured agent pipeline with defined phases, GSD-style context management remains the cleaner architecture. If you're doing single-pass analysis of large corpora, the 1M window solves your actual problem.

!Comparison table showing three approaches: GSD phases, compaction, and 1M window — each mapped to their ideal use case

What Shifts Architecturally

The most important implication of compaction is that it changes the design constraint for long-running agents.

Before compaction, the architecture question was: "How do we avoid accumulating too much context?" Answers: phase-based workflows, sub-agents with scoped tasks, aggressive session resets.

With compaction, the question shifts to: "What do we need to preserve across compaction events?" The answer shapes your instructions, your agent's data structures, and how you format tool call results.

An agent that stores important state as structured tool call outputs — rather than in the conversational history — survives compaction better. The compaction summary captures what the model inferred happened; it captures tool call results less reliably than the conversational prose. If your agent needs to remember that the database migration ran successfully on table X with constraint Y, that should be in a tool result or an explicit state artifact, not buried in a response paragraph.

This is the architectural shift compaction introduces: from "minimize accumulated context" to "structure context for graceful compaction." Both are context engineering disciplines. But the second one scales to longer sessions.

!An agent pipeline diagram showing state stored as tool outputs rather than in conversational history, surviving a compaction event intact

The Bottom Line

The 1M token window moved the capability ceiling. Compaction changes the architecture floor.

For workloads where you genuinely need to process large static corpora — an entire codebase, a year of logs, a documentation set — the 1M window removes a real constraint. Use it.

For long-running interactive agents — the kind doing tool use, iterating on problems, accumulating state across many turns — compaction is the more important feature. It's Anthropic's first automated answer to context rot: instead of asking developers to manage context manually through phase resets and structured handoffs, the API does the curation for you.

It's not perfect. A summary is a lossy representation of the conversation it replaced. The instructions you provide shape what's preserved, but no instruction set is complete. Manual phase management — when it's appropriate to the workflow — still produces cleaner context because it was never accumulated in the first place.

But for the large category of agent sessions that aren't structured around phases, compaction is a real engineering lever. Set the threshold early. Write specific instructions for your domain. Structure your agent's state as tool outputs rather than conversational prose. And monitor what the summaries look like in production — the pause_after_compaction` option exists precisely for this tuning phase.

The headline was the million tokens. The work is in the instructions.

Related: At 50% Context, Your AI Starts Cutting Corners. Here's the Fix. — the GSD approach to proactive context management