Most agentic systems fail because one prompt tries to do everything. Splitting work into discrete phases — each with a single deliverable and enforced constraints — produces dramatically more reliable results. Here's the architecture.
There’s a seductive simplicity to the single-prompt agent: one system message, one loop, one model that does it all. It investigates the problem, writes the fix, validates the output, creates the PR, and posts the summary. It sounds clean.
In practice, it’s a disaster.
The model context becomes a graveyard of partially-completed thoughts. Exploration bleeds into implementation. Implementation bleeds into validation. The agent makes file edits in the middle of an investigation turn and overwrites its own reasoning. By the time it tries to create the PR, it has no clear picture of what it actually changed or why.
The fix is simple in concept: phases.
A phase is a unit of work with exactly one deliverable and explicitly forbidden actions.
Each phase gets its own system prompt that makes the constraints explicit. Not “try to avoid editing files” — but “do NOT call edit_code or write_file. You are in REFINE. The EXECUTE phase does the edits.”
The constraints aren’t enforced by a wrapper that blocks tool calls. They’re enforced by the system prompt. This is intentional — it trains the model to self-police rather than silently swallow tool errors.
A refine system prompt might open with:
## DO NOT EDIT FILES
You are in REFINE. You produce a PLAN.
Do NOT call edit_code, write_file, edit_file, or generate_unified_diff.
The EXECUTE phase does the edits.
When the model has been told clearly what its deliverable is, it stops trying to do everything else. The cognitive load of “should I fix this now or plan it?” is eliminated.
Each phase must produce a specific structured output. Refine produces a YAML plan. Execute produces committed file changes. Validate produces a PR URL and a Jira comment. If a phase completes without its deliverable, the run is considered failed — not just incomplete.
This matters because the deliverable is the interface between phases. The execute agent doesn’t read a transcript of what refine said — it reads the YAML. Structured output forces the refine agent to compress its findings into something actionable, which also catches cases where the agent explored a lot but concluded nothing.
When a single agent handles everything, the model’s context carries both the investigation artifacts (raw grep output, partial theories, dead ends) and the implementation state (which files were edited, what the diff looks like). These interfere with each other.
Splitting into phases means each agent starts with a clean context scoped to its job. The execute agent only knows: here is the plan, here are the files. It doesn’t know about the three alternative theories the refine agent considered and discarded.
This produces smaller, more focused context windows — which produces better outputs.
Three to five phases is enough for most engineering workflows. The temptation is to keep adding phases, but each boundary adds latency and coordination overhead.
The right signal to split a phase is when a single phase is regularly doing two things that could fail independently. If your “validate” phase both runs tests and creates the PR, a test failure now blocks the PR — even if the PR should still be created as a draft for human review. Split them, and each failure mode becomes independently recoverable.
Phase-gated pipelines aren’t a framework. They’re a discipline. The constraint isn’t technical — it’s a decision to give each agent exactly one job and hold it accountable to a single deliverable.