When an agent run fails, you need to know exactly what happened: which tools ran, how long they took, when cost ceilings were hit, and what state policy flags were in at each step. Structured episodic logging makes all of this replayable.
AI agent runs are opaque by default. A run starts, uses some tools, produces an output, and ends. If something went wrong — a bad tool result, a cost ceiling hit, a policy flag triggered at the wrong moment — you have no way to reconstruct what happened without re-running the entire thing.
Episodic logging fixes this by recording every significant event during a run into a structured, queryable log. Not a debug dump — a purpose-built forensic trail.
An episodic log records distinct event types, each with a timestamp and phase context:
Phase events: when a phase starts and ends, whether it succeeded, how many turns it used, and its final cost.
Tool call events: the tool name, duration in milliseconds, and whether it returned an error. Not the inputs and outputs (too large) — just the performance and error signal.
Cost events: when a cost ceiling is hit, with the current spend and the configured cap.
Compaction events: when the context window was compacted, with the number of tokens saved.
Policy flag events: when a flag was set (force-draft, block-PR-creation), with the reason. When flags were reset between runs.
Error events: unhandled exceptions, API failures, timeouts.
interface EpisodicEvent {
ts: string; // ISO timestamp
phase: string; // refine | execute | validate | review | learn
event: string; // phase_start | tool_call | cost_ceiling | etc.
[key: string]: unknown; // event-specific payload
}
Free-text logs are for humans reading terminals. Episodic logs are for programmatic analysis.
When you want to know “which tickets hit cost ceilings in the refine phase last week,” you query the log file:
const events = readJsonl(logPath)
.filter(e => e.event === 'cost_ceiling' && e.phase === 'refine');
When you want to know “which tools are slowest across all runs,” you aggregate:
const toolStats = events
.filter(e => e.event === 'tool_call')
.reduce((acc, e) => {
acc[e.toolName] = acc[e.toolName] ?? { calls: 0, totalMs: 0 };
acc[e.toolName].calls++;
acc[e.toolName].totalMs += e.durationMs;
return acc;
}, {});
JSONL format (one JSON object per line) is the right choice: human-readable, appendable without parsing the entire file, and trivially importable into any analysis tool.
The episodic log enables replay: re-running an agent starting from a specific point in a previous run, using the recorded tool results instead of calling the actual tools again.
This is invaluable for debugging. When a run fails at turn 18, you don’t want to re-run turns 1–17 (which might be slow, expensive, or have side effects). You want to load the recorded tool results for turns 1–17 and start from turn 18 with a different prompt or a bug fix.
Replay also enables deterministic testing: record a “golden” run, then run tests against that recording to verify that a prompt change doesn’t alter the output on known inputs.
Policy flags — blast-radius blocks, force-draft triggers — are especially important to log because they’re stateful mutations that happen asynchronously during a run. If a run creates a draft PR when you expected a ready PR, you need to know which tool call triggered the force-draft flag and why.
Wiring the episodic log as a callback on the PolicyFlags onMutate event:
policyFlags.setOnMutate((snapshot, event) => {
episodic.record({
event: 'policy_flags_snapshot',
mutation: event,
snapshot,
});
});
Now every flag change is timestamped and associated with the phase it happened in. “The PR was opened as draft because blast_radius set force_draft at turn 14 in the validate phase with reason ‘calculateInvoiceTotal has 12 direct consumers’” — completely traceable.
Episodic logs are the data source for observability dashboards. Aggregate across all tickets:
These questions are easy to answer when every run emits structured events. They’re impossible to answer from free-text logs or from the model’s own output.
The minimal viable episodic log is three event types: phase_start, phase_end, tool_call. Add cost and compaction events when you need cost visibility. Add policy flag events when you add blast radius analysis.
Don’t try to log everything. Log what you’ll actually query. The value of episodic logging compounds over time as you accumulate enough runs to identify patterns — slow tools, expensive phases, tickets that always hit ceilings. Start simple and add event types when a specific question comes up that you can’t answer.