At ~70% context capacity, stop and summarize. Keep the last 5 messages verbatim. Preserve hypothesis, decisions, and pending actions. Discard raw tool outputs. This compaction strategy keeps long-running agents coherent across hundreds of turns.
Every LLM has a context window limit. For long-running agents, this isn’t a theoretical concern — it’s something you hit regularly. A 25-turn investigation phase where each turn includes tool results can easily approach 100,000 tokens before the agent finishes its work.
The naive solutions are both wrong:
The right solution is compaction: summarize the older conversation while preserving recent context verbatim.
Estimate token usage by counting characters (divide by 4 for a rough token estimate). When this estimate exceeds a threshold — typically 70% of the model’s context window — trigger compaction.
function estimateTokens(messages: MessageParam[]): number {
let chars = 0;
for (const msg of messages) {
// count text content, tool inputs, tool results
chars += extractChars(msg);
}
return Math.ceil(chars / 4);
}
export function shouldCompact(
messages: MessageParam[],
maxContextTokens: number,
threshold: number // e.g. 0.7
): boolean {
return estimateTokens(messages) > maxContextTokens * threshold;
}
The 70% threshold gives the compaction process room to operate — you need tokens available to run the summarization call itself.
The key insight: not all context is equally valuable.
Keep verbatim (last 5 messages):
Summarize (everything older):
The summarizer prompt tells the LLM exactly what matters:
Preserve:
- Current hypothesis and analysis conclusions
- Files examined and their relevance
- Decisions made and rationale
- Action items completed and pending
- Key code patterns or bugs identified
Discard:
- Raw tool outputs and search results
- Intermediate reasoning that led to final conclusions
- Duplicate information
After summarization, replace the discarded messages with two synthetic turns:
const compactedMessages: MessageParam[] = [
{
role: 'user',
content: `[CONTEXT COMPACTION — Previous conversation summarized]\n\n${summaryText}`,
},
{
role: 'assistant',
content: 'Understood. I have the summarized context and will continue from here.',
},
...recentMessages, // last 5 verbatim
];
The synthetic assistant acknowledgment is important. Without it, the message history would have two consecutive user turns, which some models reject or handle poorly.
A raw investigation turn with 10 file reads and 5 search results might be 15,000 tokens. The summary of that same content — the files found, what was relevant, the hypothesis formed — is typically 300–500 tokens. That’s a 30–50x compression ratio on older context.
Over a 25-turn run with two compaction events, you can reclaim 40,000–60,000 tokens of context window that would otherwise be occupied by raw tool output the agent no longer needs.
Compaction is lossy by design. The risk is that the summarizer drops a specific detail the agent will need later — an exact line number, a specific variable name, a file path.
Mitigations:
The scratchpad is the most important mitigation. It’s the agent’s working memory — a place to write “root cause confirmed: bug is in FooController.php line 142, the $id param is not cast to int before the query.” That fact survives any compaction event because it lives outside the message history.
Most agent implementations treat context management as an afterthought — something you add when the model starts complaining about token limits. The better approach is to design for it from the start.
Define your compaction threshold. Build the scratchpad pattern from day one. Log compaction events so you can measure how often they fire and what they saved. An agent that gracefully manages its own context window is dramatically more reliable than one that degrades silently as the context fills up.