Files
distribution/.agents/subplans/08-orchestration-and-qa.md

4.5 KiB

Subplan 08 - Orchestration, Checkpoints, And QA

Goal

Keep agent work resumable, auditable, and constrained.

Required Files

  • .agents/ORCHESTRATION.md
  • .agents/CHECKPOINT.md
  • .agents/checkpoint.json
  • .agents/subplans/README.md
  • .agents/runs/<run-id>/...

Run Creation

For each substantial task:

  1. Create .agents/runs/<run-id>/.
  2. Copy relevant acceptance criteria into brief.md.
  3. Record the active branch.
  4. Record allowed write paths.
  5. Update checkpoint before invoking Opus.

Manager planning sessions count as substantial tasks when the user expects observability. Create a run record for planning work too, with Codex manager as the active agent.

Agent Liveness

Each active run should include:

.agents/runs/<run-id>/
  state.json
  agents.json

state.json records:

  • run id
  • current phase
  • current branch
  • input artifact
  • input hash
  • expected output artifact
  • retry count
  • next action
  • blocker, if any

agents.json records:

  • role
  • status: queued, running, completed, failed, dead, restarted
  • started_at
  • heartbeat_at
  • lease_expires_at
  • thread id, process id, or tool call id when available
  • last_error

Status And Progress Tail

Use the local status command from the workspace root:

node .agents/scripts/status.mjs --tail 40

For a browser dashboard:

node .agents/scripts/observe.mjs --port 4317

Then open http://127.0.0.1:4317.

The command reads:

  • .agents/checkpoint.json
  • every .agents/runs/<run-id>/state.json
  • every .agents/runs/<run-id>/agents.json

It prints:

  • all known runs
  • current phase and branch
  • all recorded agents and their lease status
  • expired leases
  • blockers
  • the next action
  • the tail of the active run's progress file

Progress files are checked in this order:

  • progress.log
  • implementation-log.md
  • validation.md
  • opus-proposal.md
  • snarky-review.md

Managers should append progress events to progress.log whenever possible. Human-readable phase artifacts still stay in their named markdown files.

Agent Restart Policy

If an agent dies, the manager restarts the role from durable files, not memory.

Snarky restart:

  • Read PLAN.md, .agents/ORCHESTRATION.md, checkpoint files, and active run artifacts.
  • Re-run the current Snarky phase using the same input artifact.
  • Replace only the expected Snarky output for that phase.

Opus proposal/review restart:

  • Re-run the same opus-runner planning tool with the same input artifact.
  • Keep previous failed output, if any, as diagnostic context.
  • Do not advance until the expected output validates.
  • Use a lease and CLI/tool timeout of at least 30 minutes.

Opus implementation restart:

  • Check current branch.
  • Check dirty state.
  • If the branch is clean, retry the same implementation instruction.
  • If the branch is dirty, manager must inspect the diff and decide whether to continue, ask Opus to repair, or ask the user.
  • Never auto-reset or discard partial Opus edits.
  • Use a lease and CLI/tool timeout of at least 30 minutes.

Retry limits:

  • Retry a dead planning phase up to 2 times.
  • Retry an implementation phase up to 1 time without user input.
  • After the retry cap, record a blocker in checkpoint files.

Checkpoint Policy

Update checkpoints:

  • at the start of a run
  • after Snarky brief
  • after Opus proposal
  • after concern resolution
  • before Opus implementation
  • after implementation
  • after validation
  • before ending the turn

Checkpoint must include:

  • timestamp
  • current phase
  • current branch
  • active run id
  • completed artifacts
  • next action
  • blockers
  • commands/checks already run

Validation Layers

Manager validation:

  • git branch and diff check
  • out-of-scope file check
  • syntax checks
  • unit/integration tests when available
  • browser/screenshot checks for UI work when available

Snarky validation:

  • product requirements
  • acceptance criteria
  • install flow accuracy
  • scope discipline

Opus validation:

  • layout/design quality
  • interaction clarity
  • technical design concerns

Resume Procedure

After power loss or interruption:

  1. Read .agents/CHECKPOINT.md.
  2. Read .agents/checkpoint.json.
  3. Check git branch and dirty state.
  4. Read the active run directory if present.
  5. Continue from next_action.
  6. Do not assume an Opus implementation completed unless validation is recorded.

Current Known Setup Issue

The distribution workspace is not currently a git repository. Branch-based Opus implementation requires initializing git or moving these files into a repo with a clean baseline commit.