CODING PIPELINE

From Issue to
Merged PR.

Takes a GitHub issue, produces a merged PR through 10 automated phases. No human writes code.

OVERVIEW

What it does

The coding pipeline takes a GitHub issue as input and produces a merged pull request as output. Ten automated phases handle everything from validating configuration to creating the PR. Each phase is gated -- nothing moves forward without passing its quality check. The builder never sees the behavioral contracts used to test its work. The reviewer is a higher-tier model than the builder. The scoring is deterministic, not LLM-based.

ARCHITECTURE

Pipeline phases

01

Preflight

Validates configuration, checks git state, runs dedup detection to prevent redundant work.

02

Spec

Sonnet reads the issue and produces behavioral contracts (Section A) plus an implementation plan (Section B). Writes spec.md and spec-behavioral.md as structurally isolated files.

03

Spec Adversary

Sonnet adversarially reviews the spec. Checks specificity, trivial satisfaction, edge case coverage, implementation leakage, and alignment. Can REQUEST_CHANGES up to 4 times.

04

Acceptance Tests

Sonnet writes pytest tests from spec-behavioral.md ONLY. Never sees implementation details. Tests become immutable constraints that the implementation must satisfy.

05

Implement

Sonnet writes the code. Has access to the full spec and codebase context. The implementation is guided by the spec but tested against behavioral contracts it cannot see.

06

Acceptance Run

Deterministic pytest execution. No LLM involved. The acceptance tests written in phase 4 are run against the implementation. Pass or fail -- no negotiation.

07

Review

Opus reviews the implementation. Can APPROVE or REQUEST_CHANGES. If REQUEST_CHANGES, loops back to a fix phase. The reviewer is a higher-tier model than the builder.

08

Test

Full test suite run. Deterministic execution of the entire test suite, not just acceptance tests. No LLM involvement.

09

Scoring

Composite scorer combines acceptance results, review verdict, and test results. Rule-based, not LLM. Outputs a 0.0 to 1.0 confidence score. Deterministic quality gate.

10

Postflight

Creates PR, pushes branch. Routes based on score: 0.90+ auto-merge, 0.75-0.90 human review, below 0.50 reject.

DESIGN

Key decisions

Two spec files, not one

Behavioral contracts are separated from implementation guidance. The acceptance test agent never sees implementation details -- it tests behavior, not structure.

Adversarial review before building

Catches vague specs before they waste tokens. A bad spec discovered at the review phase costs 10x more than one caught at the spec phase.

Tests before code

Acceptance tests are written before implementation exists. The tests define what "done" means. The builder writes code to pass them, not the other way around.

Reviewer smarter than builder

Review uses Opus (highest tier). The reviewer is a more capable model than the builder (Sonnet). You want your critic to be sharper than your author.

Rule-based scoring

Scoring is deterministic, not LLM-based. Combines acceptance results, review verdict, and test results into a 0.0-1.0 score. No model can talk its way past the gate.

RESULTS

Production numbers

0
Pipeline runs
0
Pipeline phases
0
Tests in suite
0
Tool types (read/write/bash/grep/edit/glob)