Back to Writing
AI Engineering 2026-05-01

Managing Long-Running Agent Context Without Losing the Plot

At ~70% context capacity, stop and summarize. Keep the last 5 messages verbatim. Preserve hypothesis, decisions, and pending actions. Discard raw tool outputs. This compaction strategy keeps long-running agents coherent across hundreds of turns.


Every LLM has a context window limit. For long-running agents, this isn’t a theoretical concern — it’s something you hit regularly. A 25-turn investigation phase where each turn includes tool results can easily approach 100,000 tokens before the agent finishes its work.

The naive solutions are both wrong:

  • Truncate from the front: loses the initial problem statement and early findings
  • Hard stop: forces the agent to restart, losing all accumulated state

The right solution is compaction: summarize the older conversation while preserving recent context verbatim.


When to Compact

Estimate token usage by counting characters (divide by 4 for a rough token estimate). When this estimate exceeds a threshold — typically 70% of the model’s context window — trigger compaction.

function estimateTokens(messages: MessageParam[]): number {
  let chars = 0;
  for (const msg of messages) {
    // count text content, tool inputs, tool results
    chars += extractChars(msg);
  }
  return Math.ceil(chars / 4);
}

export function shouldCompact(
  messages: MessageParam[],
  maxContextTokens: number,
  threshold: number // e.g. 0.7
): boolean {
  return estimateTokens(messages) > maxContextTokens * threshold;
}

The 70% threshold gives the compaction process room to operate — you need tokens available to run the summarization call itself.


What to Keep vs Discard

The key insight: not all context is equally valuable.

Keep verbatim (last 5 messages):

  • The most recent tool results — the agent is actively working from these
  • The most recent assistant reasoning — the current hypothesis
  • The most recent user turn — any injected system nudges

Summarize (everything older):

  • Earlier investigation steps and file reads
  • Discarded hypotheses and dead ends
  • Intermediate reasoning that led to a conclusion the agent has already committed to

The summarizer prompt tells the LLM exactly what matters:

Preserve:
- Current hypothesis and analysis conclusions
- Files examined and their relevance
- Decisions made and rationale
- Action items completed and pending
- Key code patterns or bugs identified

Discard:
- Raw tool outputs and search results
- Intermediate reasoning that led to final conclusions
- Duplicate information

The Compacted Message Structure

After summarization, replace the discarded messages with two synthetic turns:

const compactedMessages: MessageParam[] = [
  {
    role: 'user',
    content: `[CONTEXT COMPACTION — Previous conversation summarized]\n\n${summaryText}`,
  },
  {
    role: 'assistant',
    content: 'Understood. I have the summarized context and will continue from here.',
  },
  ...recentMessages, // last 5 verbatim
];

The synthetic assistant acknowledgment is important. Without it, the message history would have two consecutive user turns, which some models reject or handle poorly.


What Gets Saved

A raw investigation turn with 10 file reads and 5 search results might be 15,000 tokens. The summary of that same content — the files found, what was relevant, the hypothesis formed — is typically 300–500 tokens. That’s a 30–50x compression ratio on older context.

Over a 25-turn run with two compaction events, you can reclaim 40,000–60,000 tokens of context window that would otherwise be occupied by raw tool output the agent no longer needs.


The Risk: Losing Critical Details

Compaction is lossy by design. The risk is that the summarizer drops a specific detail the agent will need later — an exact line number, a specific variable name, a file path.

Mitigations:

  • Scratchpad: write key findings to a persistent scratchpad during investigation. The scratchpad is injected into every system prompt, so its contents survive compaction.
  • Keep more recent messages: increasing the verbatim window from 5 to 8 messages reduces the risk of dropping context that’s still actively relevant.
  • Log compaction events: record how many tokens were saved per compaction so you can tune the threshold.

The scratchpad is the most important mitigation. It’s the agent’s working memory — a place to write “root cause confirmed: bug is in FooController.php line 142, the $id param is not cast to int before the query.” That fact survives any compaction event because it lives outside the message history.


Treating Compaction as a First-Class Concern

Most agent implementations treat context management as an afterthought — something you add when the model starts complaining about token limits. The better approach is to design for it from the start.

Define your compaction threshold. Build the scratchpad pattern from day one. Log compaction events so you can measure how often they fire and what they saved. An agent that gracefully manages its own context window is dramatically more reliable than one that degrades silently as the context fills up.