Takes a GitHub issue, produces a merged PR through 10 automated phases. No human writes code.
OVERVIEW
The coding pipeline takes a GitHub issue as input and produces a merged pull request as output. Ten automated phases handle everything from validating configuration to creating the PR. Each phase is gated -- nothing moves forward without passing its quality check. The builder never sees the behavioral contracts used to test its work. The reviewer is a higher-tier model than the builder. The scoring is deterministic, not LLM-based.
ARCHITECTURE
Validates configuration, checks git state, runs dedup detection to prevent redundant work.
Sonnet reads the issue and produces behavioral contracts (Section A) plus an implementation plan (Section B). Writes spec.md and spec-behavioral.md as structurally isolated files.
Sonnet adversarially reviews the spec. Checks specificity, trivial satisfaction, edge case coverage, implementation leakage, and alignment. Can REQUEST_CHANGES up to 4 times.
Sonnet writes pytest tests from spec-behavioral.md ONLY. Never sees implementation details. Tests become immutable constraints that the implementation must satisfy.
Sonnet writes the code. Has access to the full spec and codebase context. The implementation is guided by the spec but tested against behavioral contracts it cannot see.
Deterministic pytest execution. No LLM involved. The acceptance tests written in phase 4 are run against the implementation. Pass or fail -- no negotiation.
Opus reviews the implementation. Can APPROVE or REQUEST_CHANGES. If REQUEST_CHANGES, loops back to a fix phase. The reviewer is a higher-tier model than the builder.
Full test suite run. Deterministic execution of the entire test suite, not just acceptance tests. No LLM involvement.
Composite scorer combines acceptance results, review verdict, and test results. Rule-based, not LLM. Outputs a 0.0 to 1.0 confidence score. Deterministic quality gate.
Creates PR, pushes branch. Routes based on score: 0.90+ auto-merge, 0.75-0.90 human review, below 0.50 reject.
DESIGN
Behavioral contracts are separated from implementation guidance. The acceptance test agent never sees implementation details -- it tests behavior, not structure.
Catches vague specs before they waste tokens. A bad spec discovered at the review phase costs 10x more than one caught at the spec phase.
Acceptance tests are written before implementation exists. The tests define what "done" means. The builder writes code to pass them, not the other way around.
Review uses Opus (highest tier). The reviewer is a more capable model than the builder (Sonnet). You want your critic to be sharper than your author.
Scoring is deterministic, not LLM-based. Combines acceptance results, review verdict, and test results into a 0.0-1.0 score. No model can talk its way past the gate.
RESULTS