Epistemic Honesty in AI Agents: Evidence-Gated Confidence Scores

AI agents are confidently wrong more often than they admit to being uncertain.

The default behavior of a language model producing a structured output is to fill in all fields plausibly — including confidence scores. If the schema has a confidence: high|medium|low field, the model will produce high unless explicitly told not to. It’s not lying; it’s pattern-completing. But the effect is that confidence scores become meaningless noise, and the humans relying on them make decisions they shouldn’t.

Evidence-gating fixes this by making confidence a derived value, not a self-assessed one.

The Evidence Requirement

Every action item in a plan must include a concrete evidence list: the actual tool results that ground the proposed change.

action_items:
  - id: 1
    title: "Cast $id to int before the database query"
    file: "app/controllers/InvoiceController.php"
    type: modify
    details: "Line 142: $id is used directly in a prepared statement without casting."
    evidence:
      - "grep_search: found direct usage at InvoiceController.php:142"
      - "read_file: confirmed $id is passed as string from $_GET without sanitization"
      - "read_file: prepared statement at line 145 expects integer binding"
    confidence: high

The rule: confidence: high requires that the agent has read the exact lines it intends to change and confirmed the root cause reproduces the reported symptom.

If the agent can’t point to direct evidence, it must set confidence: low and add needs_human_review: true.

Why This Changes Agent Behavior

Requiring evidence in the schema doesn’t just make the output more readable. It changes how the agent investigates.

Without evidence requirements, an agent can produce a confident-sounding plan based entirely on inference: “this is probably the issue because…” With evidence requirements, the agent has to go read the file. It has to run the grep. It can’t claim high confidence until it has actually observed the problem.

This is the same discipline that distinguishes good code review from rubber-stamping. The reviewer who says “LGTM” without reading the diff is useless. The reviewer who says “I checked lines 89–142, the logic is correct, the test covers the edge case” is valuable. The evidence requirement forces the agent into the second mode.

The Policy Consequence: Draft PRs and Blocks

Evidence-gating only matters if the output actually changes based on confidence. That means wiring confidence scores to downstream behavior.

confidence: high, no human review flag: create a ready-to-merge PR.

confidence: medium: create a ready-to-merge PR, include a note in the description about what was uncertain.

confidence: low or needs_human_review: true: create a draft PR. Post a Jira comment explaining what evidence is missing and what a human needs to verify before merging.

confidence: low AND blast radius > threshold: block PR creation entirely. Post a Jira comment with the specific missing evidence and the blast radius summary. Do not create a PR until a human resolves the uncertainty.

if (planGate.needsHumanReview && blastRadius.highRisk) {
  policyFlags.enableBlockPrCreation(
    `Low confidence fix with ${blastRadius.consumerCount} consumers. ` +
    `Missing evidence: ${planGate.missingEvidence.join(', ')}`
  );
}

Anti-Patterns in Confidence Self-Assessment

Self-upgrading to look decisive: the agent sets confidence: high because it sounds better, even though it hasn’t read the relevant file. The schema constraint alone doesn’t prevent this — you need the evidence requirement to make high defensible.

Evidence by assertion: listing “the bug is on line 142” as evidence without it being a tool result. Evidence must be a concrete observation from a tool call — a grep hit, a line range that was read, a field that was fetched from the ticket.

Confidence without scope: “confidence: high” for what? High confidence that the root cause is correctly identified? That the proposed change is correct? That the change won’t break anything? Make the scope explicit in the evidence list.

The Compounding Effect

Epistemic honesty compounds over time in two ways.

First, the humans receiving draft PRs learn to trust them differently from ready-to-merge PRs. A draft is a signal: “I think this is right but I need you to check X.” That information is useful. It focuses human review on exactly the thing that needs review.

Second, the auto-demote feedback loop (tracking which PRs with needs_human_review: true are eventually closed without merge) uses the evidence quality as a signal for playbook bullet reliability. Lessons derived from low-confidence fixes that were rejected by humans get demoted automatically.

The system gets better at knowing what it doesn’t know.

The Broader Principle

Confidence without evidence is noise. In software systems, it leads to over-reliance on automated decisions that haven’t earned that trust. The fix is structural: make evidence a required field, make confidence a derived value, and make the downstream behavior visibly different based on what confidence was earned.

An agent that says “I don’t know, here’s why, here’s what a human needs to check” is more valuable than one that confidently ships a fix it hasn’t actually verified.