Building a Self-Improving Knowledge Base for AI Engineering Agents

A knowledge base for AI agents starts useful and degrades over time — unless you build the feedback loop in from the beginning.

The typical pattern: an agent solves a problem, you extract the lesson, you add it to the playbook. The playbook grows. Months later, 40% of the playbook is stale patterns from tickets that turned out to be edge cases, deprecated APIs, or fixes that were later reverted. The agent loads all of it into context. Signal-to-noise drops. Retrieval quality degrades.

The fix isn’t manual curation. It’s measuring lesson usage and auto-demoting bullets that fail the signal test.

What “Lesson Usage” Means

Every time a lesson is retrieved and shown to an agent, that’s a load event. Every time the agent cites a lesson in its own output (referencing its ID while reasoning), that’s a cite event.

These are different signals:

A bullet that’s loaded but never cited might be irrelevant to the tickets it’s being retrieved for
A bullet that’s cited frequently in successful runs is earning its keep
A bullet that’s cited frequently in failed runs might be actively harmful — a misleading pattern

Record both events with the phase name, ticket ID, and success flag:

recordLessonLoaded({
  lessonIds: retrievedBullets.map(b => b.id),
  consumer: 'refine',
  ticketId,
  phase: 'refine',
});

// Later, after the phase completes:
recordLessonCited({
  lessonIds: [...citedIds],
  ticketId,
  phase: 'refine',
  phaseSuccess: success,
});

These events accumulate in a .lesson-usage.jsonl file — one line per event, easy to query.

The Auto-Demote Signal

The strongest negative signal is: a bullet was learned from a ticket whose PR was eventually closed without merge.

A closed-unmerged PR means a human reviewed the fix and rejected it. If the playbook has a lesson derived from that fix, the lesson is likely a bad pattern. It should be removed from future agent context.

The demote logic:

Find tickets where needsHumanReview: true was set (meaning the agent itself flagged uncertainty)
Check whether the PR for that ticket was closed without merge
If yes, find all playbook bullets with learned_from: TICKET-ID
Set deprecated: true on those bullets

if (ticket.planGate?.needsHumanReview && ticket.pr?.state === 'closed') {
  const bullets = playbook.filter(b => b.learned_from === ticket.ticket);
  for (const bullet of bullets) {
    bullet.deprecated = true;
    bullet.demoted_on = new Date().toISOString();
    bullet.demoted_reason = 'pr_closed_unmerged';
  }
}

Bullets are never deleted — they’re soft-deprecated. The deprecated: true flag causes the loader to exclude them from retrieval. The record is preserved for audit purposes.

BM25 Retrieval Over Static Keywords

Retrieval quality matters as much as KB quality. The naive approach — deriving query keywords from the ticket ID (CTMS-4337 → keywords ['CTMS', '4337']) — matches zero bullets in practice. Ticket IDs carry no semantic content.

The correct approach derives keywords from the ticket’s content: summary, description, and labels. BM25 tokenizes internally, so you don’t need to over-normalize — just pass the raw text.

function synthesizeKeywords(ticket: JiraTicket): string[] {
  return [
    ticket.summary,
    ticket.description?.slice(0, 500) ?? '',
    ...(ticket.labels ?? []),
  ].filter(s => s.length > 0);
}

Cap description length. Long ticket bodies dominate term-frequency weighting and drown out the summary and label signal that’s actually most relevant for playbook matching.

The Prune Command

Beyond auto-demote, run periodic pruning on bullets that have been in the corpus for more than 30 days without any load or cite events. These are lessons that no ticket has triggered — dead weight.

Add safety gates:

Minimum bullet age (don’t prune something added yesterday)
Minimum observation period (don’t prune anything until you have 30 days of data)
Dry-run by default (show what would be pruned before doing it)

The combination of auto-demote (reacts to explicit rejection signals) and scheduled pruning (clears unused lessons) keeps the playbook from accumulating noise indefinitely.

Why This Matters

A 200-bullet playbook where 40 bullets are stale or counterproductive is worse than a 100-bullet playbook where every bullet applies. Context window space is finite. Retrieval recall degrades as noise increases. The agent reads irrelevant bullets and either ignores them (wasted tokens) or is confused by them (active harm).

The feedback loop — retrieve, record, cite, measure, demote — turns the knowledge base from a write-only log into a living system that improves with every run.