4.5 KiB
Subplan 08 - Orchestration, Checkpoints, And QA
Goal
Keep agent work resumable, auditable, and constrained.
Required Files
.agents/ORCHESTRATION.md.agents/CHECKPOINT.md.agents/checkpoint.json.agents/subplans/README.md.agents/runs/<run-id>/...
Run Creation
For each substantial task:
- Create
.agents/runs/<run-id>/. - Copy relevant acceptance criteria into
brief.md. - Record the active branch.
- Record allowed write paths.
- Update checkpoint before invoking Opus.
Manager planning sessions count as substantial tasks when the user expects observability. Create a run record for planning work too, with Codex manager as the active agent.
Agent Liveness
Each active run should include:
.agents/runs/<run-id>/
state.json
agents.json
state.json records:
- run id
- current phase
- current branch
- input artifact
- input hash
- expected output artifact
- retry count
- next action
- blocker, if any
agents.json records:
- role
- status: queued, running, completed, failed, dead, restarted
- started_at
- heartbeat_at
- lease_expires_at
- thread id, process id, or tool call id when available
- last_error
Status And Progress Tail
Use the local status command from the workspace root:
node .agents/scripts/status.mjs --tail 40
For a browser dashboard:
node .agents/scripts/observe.mjs --port 4317
Then open http://127.0.0.1:4317.
The command reads:
.agents/checkpoint.json- every
.agents/runs/<run-id>/state.json - every
.agents/runs/<run-id>/agents.json
It prints:
- all known runs
- current phase and branch
- all recorded agents and their lease status
- expired leases
- blockers
- the next action
- the tail of the active run's progress file
Progress files are checked in this order:
progress.logimplementation-log.mdvalidation.mdopus-proposal.mdsnarky-review.md
Managers should append progress events to progress.log whenever possible.
Human-readable phase artifacts still stay in their named markdown files.
Agent Restart Policy
If an agent dies, the manager restarts the role from durable files, not memory.
Snarky restart:
- Read
PLAN.md,.agents/ORCHESTRATION.md, checkpoint files, and active run artifacts. - Re-run the current Snarky phase using the same input artifact.
- Replace only the expected Snarky output for that phase.
Opus proposal/review restart:
- Re-run the same
opus-runnerplanning tool with the same input artifact. - Keep previous failed output, if any, as diagnostic context.
- Do not advance until the expected output validates.
- Use a lease and CLI/tool timeout of at least 30 minutes.
Opus implementation restart:
- Check current branch.
- Check dirty state.
- If the branch is clean, retry the same implementation instruction.
- If the branch is dirty, manager must inspect the diff and decide whether to continue, ask Opus to repair, or ask the user.
- Never auto-reset or discard partial Opus edits.
- Use a lease and CLI/tool timeout of at least 30 minutes.
Retry limits:
- Retry a dead planning phase up to 2 times.
- Retry an implementation phase up to 1 time without user input.
- After the retry cap, record a blocker in checkpoint files.
Checkpoint Policy
Update checkpoints:
- at the start of a run
- after Snarky brief
- after Opus proposal
- after concern resolution
- before Opus implementation
- after implementation
- after validation
- before ending the turn
Checkpoint must include:
- timestamp
- current phase
- current branch
- active run id
- completed artifacts
- next action
- blockers
- commands/checks already run
Validation Layers
Manager validation:
- git branch and diff check
- out-of-scope file check
- syntax checks
- unit/integration tests when available
- browser/screenshot checks for UI work when available
Snarky validation:
- product requirements
- acceptance criteria
- install flow accuracy
- scope discipline
Opus validation:
- layout/design quality
- interaction clarity
- technical design concerns
Resume Procedure
After power loss or interruption:
- Read
.agents/CHECKPOINT.md. - Read
.agents/checkpoint.json. - Check git branch and dirty state.
- Read the active run directory if present.
- Continue from
next_action. - Do not assume an Opus implementation completed unless validation is recorded.
Current Known Setup Issue
The distribution workspace is not currently a git repository. Branch-based Opus implementation requires initializing git or moving these files into a repo with a clean baseline commit.