Episodic Logs: A Forensic Trail for AI Agent Runs

AI agent runs are opaque by default. A run starts, uses some tools, produces an output, and ends. If something went wrong — a bad tool result, a cost ceiling hit, a policy flag triggered at the wrong moment — you have no way to reconstruct what happened without re-running the entire thing.

Episodic logging fixes this by recording every significant event during a run into a structured, queryable log. Not a debug dump — a purpose-built forensic trail.

What Gets Recorded

An episodic log records distinct event types, each with a timestamp and phase context:

Phase events: when a phase starts and ends, whether it succeeded, how many turns it used, and its final cost.

Tool call events: the tool name, duration in milliseconds, and whether it returned an error. Not the inputs and outputs (too large) — just the performance and error signal.

Cost events: when a cost ceiling is hit, with the current spend and the configured cap.

Compaction events: when the context window was compacted, with the number of tokens saved.

Policy flag events: when a flag was set (force-draft, block-PR-creation), with the reason. When flags were reset between runs.

Error events: unhandled exceptions, API failures, timeouts.

interface EpisodicEvent {
  ts: string;           // ISO timestamp
  phase: string;        // refine | execute | validate | review | learn
  event: string;        // phase_start | tool_call | cost_ceiling | etc.
  [key: string]: unknown; // event-specific payload
}

Why Structured, Not Free-Text

Free-text logs are for humans reading terminals. Episodic logs are for programmatic analysis.

When you want to know “which tickets hit cost ceilings in the refine phase last week,” you query the log file:

const events = readJsonl(logPath)
  .filter(e => e.event === 'cost_ceiling' && e.phase === 'refine');

When you want to know “which tools are slowest across all runs,” you aggregate:

const toolStats = events
  .filter(e => e.event === 'tool_call')
  .reduce((acc, e) => {
    acc[e.toolName] = acc[e.toolName] ?? { calls: 0, totalMs: 0 };
    acc[e.toolName].calls++;
    acc[e.toolName].totalMs += e.durationMs;
    return acc;
  }, {});

JSONL format (one JSON object per line) is the right choice: human-readable, appendable without parsing the entire file, and trivially importable into any analysis tool.

Replay Mode

The episodic log enables replay: re-running an agent starting from a specific point in a previous run, using the recorded tool results instead of calling the actual tools again.

This is invaluable for debugging. When a run fails at turn 18, you don’t want to re-run turns 1–17 (which might be slow, expensive, or have side effects). You want to load the recorded tool results for turns 1–17 and start from turn 18 with a different prompt or a bug fix.

Replay also enables deterministic testing: record a “golden” run, then run tests against that recording to verify that a prompt change doesn’t alter the output on known inputs.

The Policy Flag Connection

Policy flags — blast-radius blocks, force-draft triggers — are especially important to log because they’re stateful mutations that happen asynchronously during a run. If a run creates a draft PR when you expected a ready PR, you need to know which tool call triggered the force-draft flag and why.

Wiring the episodic log as a callback on the PolicyFlags onMutate event:

policyFlags.setOnMutate((snapshot, event) => {
  episodic.record({
    event: 'policy_flags_snapshot',
    mutation: event,
    snapshot,
  });
});

Now every flag change is timestamped and associated with the phase it happened in. “The PR was opened as draft because blast_radius set force_draft at turn 14 in the validate phase with reason ‘calculateInvoiceTotal has 12 direct consumers’” — completely traceable.

Dashboard Integration

Episodic logs are the data source for observability dashboards. Aggregate across all tickets:

Which phases fail most often?
Which tools are slowest?
What’s the average cost per ticket per phase?
Which tickets hit cost ceilings before producing output?
What’s the p95 turn count for the refine phase?

These questions are easy to answer when every run emits structured events. They’re impossible to answer from free-text logs or from the model’s own output.

Start Simple

The minimal viable episodic log is three event types: phase_start, phase_end, tool_call. Add cost and compaction events when you need cost visibility. Add policy flag events when you add blast radius analysis.

Don’t try to log everything. Log what you’ll actually query. The value of episodic logging compounds over time as you accumulate enough runs to identify patterns — slow tools, expensive phases, tickets that always hit ceilings. Start simple and add event types when a specific question comes up that you can’t answer.

Related: Episodic logging is canonical evidence for dimension D10 (observability) in the MCP design rubric. The Brodels corpus entry uses this exact pattern; pattern P-12 (episodic audit logs) cites it as the reference implementation for audit-required MCP operations. See the MCP Architect skill stack for the full rubric.