201 lines
4.5 KiB
Markdown
201 lines
4.5 KiB
Markdown
# Subplan 08 - Orchestration, Checkpoints, And QA
|
|
|
|
## Goal
|
|
|
|
Keep agent work resumable, auditable, and constrained.
|
|
|
|
## Required Files
|
|
|
|
- `.agents/ORCHESTRATION.md`
|
|
- `.agents/CHECKPOINT.md`
|
|
- `.agents/checkpoint.json`
|
|
- `.agents/subplans/README.md`
|
|
- `.agents/runs/<run-id>/...`
|
|
|
|
## Run Creation
|
|
|
|
For each substantial task:
|
|
|
|
1. Create `.agents/runs/<run-id>/`.
|
|
2. Copy relevant acceptance criteria into `brief.md`.
|
|
3. Record the active branch.
|
|
4. Record allowed write paths.
|
|
5. Update checkpoint before invoking Opus.
|
|
|
|
Manager planning sessions count as substantial tasks when the user expects
|
|
observability. Create a run record for planning work too, with Codex manager as
|
|
the active agent.
|
|
|
|
## Agent Liveness
|
|
|
|
Each active run should include:
|
|
|
|
```txt
|
|
.agents/runs/<run-id>/
|
|
state.json
|
|
agents.json
|
|
```
|
|
|
|
`state.json` records:
|
|
|
|
- run id
|
|
- current phase
|
|
- current branch
|
|
- input artifact
|
|
- input hash
|
|
- expected output artifact
|
|
- retry count
|
|
- next action
|
|
- blocker, if any
|
|
|
|
`agents.json` records:
|
|
|
|
- role
|
|
- status: queued, running, completed, failed, dead, restarted
|
|
- started_at
|
|
- heartbeat_at
|
|
- lease_expires_at
|
|
- thread id, process id, or tool call id when available
|
|
- last_error
|
|
|
|
## Status And Progress Tail
|
|
|
|
Use the local status command from the workspace root:
|
|
|
|
```sh
|
|
node .agents/scripts/status.mjs --tail 40
|
|
```
|
|
|
|
For a browser dashboard:
|
|
|
|
```sh
|
|
node .agents/scripts/observe.mjs --port 4317
|
|
```
|
|
|
|
Then open `http://127.0.0.1:4317`.
|
|
|
|
The command reads:
|
|
|
|
- `.agents/checkpoint.json`
|
|
- every `.agents/runs/<run-id>/state.json`
|
|
- every `.agents/runs/<run-id>/agents.json`
|
|
|
|
It prints:
|
|
|
|
- all known runs
|
|
- current phase and branch
|
|
- all recorded agents and their lease status
|
|
- expired leases
|
|
- blockers
|
|
- the next action
|
|
- the tail of the active run's progress file
|
|
|
|
Progress files are checked in this order:
|
|
|
|
- `progress.log`
|
|
- `implementation-log.md`
|
|
- `validation.md`
|
|
- `opus-proposal.md`
|
|
- `snarky-review.md`
|
|
|
|
Managers should append progress events to `progress.log` whenever possible.
|
|
Human-readable phase artifacts still stay in their named markdown files.
|
|
|
|
## Agent Restart Policy
|
|
|
|
If an agent dies, the manager restarts the role from durable files, not memory.
|
|
|
|
Snarky restart:
|
|
|
|
- Read `PLAN.md`, `.agents/ORCHESTRATION.md`, checkpoint files, and active run
|
|
artifacts.
|
|
- Re-run the current Snarky phase using the same input artifact.
|
|
- Replace only the expected Snarky output for that phase.
|
|
|
|
Opus proposal/review restart:
|
|
|
|
- Re-run the same `opus-runner` planning tool with the same input artifact.
|
|
- Keep previous failed output, if any, as diagnostic context.
|
|
- Do not advance until the expected output validates.
|
|
- Use a lease and CLI/tool timeout of at least 30 minutes.
|
|
|
|
Opus implementation restart:
|
|
|
|
- Check current branch.
|
|
- Check dirty state.
|
|
- If the branch is clean, retry the same implementation instruction.
|
|
- If the branch is dirty, manager must inspect the diff and decide whether to
|
|
continue, ask Opus to repair, or ask the user.
|
|
- Never auto-reset or discard partial Opus edits.
|
|
- Use a lease and CLI/tool timeout of at least 30 minutes.
|
|
|
|
Retry limits:
|
|
|
|
- Retry a dead planning phase up to 2 times.
|
|
- Retry an implementation phase up to 1 time without user input.
|
|
- After the retry cap, record a blocker in checkpoint files.
|
|
|
|
## Checkpoint Policy
|
|
|
|
Update checkpoints:
|
|
|
|
- at the start of a run
|
|
- after Snarky brief
|
|
- after Opus proposal
|
|
- after concern resolution
|
|
- before Opus implementation
|
|
- after implementation
|
|
- after validation
|
|
- before ending the turn
|
|
|
|
Checkpoint must include:
|
|
|
|
- timestamp
|
|
- current phase
|
|
- current branch
|
|
- active run id
|
|
- completed artifacts
|
|
- next action
|
|
- blockers
|
|
- commands/checks already run
|
|
|
|
## Validation Layers
|
|
|
|
Manager validation:
|
|
|
|
- git branch and diff check
|
|
- out-of-scope file check
|
|
- syntax checks
|
|
- unit/integration tests when available
|
|
- browser/screenshot checks for UI work when available
|
|
|
|
Snarky validation:
|
|
|
|
- product requirements
|
|
- acceptance criteria
|
|
- install flow accuracy
|
|
- scope discipline
|
|
|
|
Opus validation:
|
|
|
|
- layout/design quality
|
|
- interaction clarity
|
|
- technical design concerns
|
|
|
|
## Resume Procedure
|
|
|
|
After power loss or interruption:
|
|
|
|
1. Read `.agents/CHECKPOINT.md`.
|
|
2. Read `.agents/checkpoint.json`.
|
|
3. Check git branch and dirty state.
|
|
4. Read the active run directory if present.
|
|
5. Continue from `next_action`.
|
|
6. Do not assume an Opus implementation completed unless validation is recorded.
|
|
|
|
## Current Known Setup Issue
|
|
|
|
The distribution workspace is not currently a git repository. Branch-based Opus
|
|
implementation requires initializing git or moving these files into a repo with
|
|
a clean baseline commit.
|