A cancelled async worker now abandons its body at its next suspend instead
of running to completion.
- Cancel-flag back-ref (D4): SpawnOpts.cancel_flag (core.sx) + Fiber.cancel_flag
(sched.sx), set from opts.cancel_flag in Scheduler.spawn_raw; async passes
xx @f.canceled (the Future.canceled Atomic(bool) erased to *void).
- Delivery: Scheduler.suspend_raw consults fiber_canceled(self.current) PRE-park
(raise without parking — no deadlock if cancel landed before the worker ran)
and POST-resume (cancel landed while parked), raising error.Canceled.
cancel(f) flips the sticky flag, marks .canceled, and wakes the worker.
- async worker is failable Closure() -> ($R, !); the completion closure
f.value = worker() catch {…} marks .canceled/.failed and wakes the awaiter,
so post-suspend side effects never run. New failable io.sleep(ms) is the
cancellation point.
- Compiler: a -> ! fn whose only error source is try-ing a protocol method
(io.suspend_raw) was wrongly flagged 'declared ! but never errors';
collectErrorSites now marks a try of a non-identifier callee as a dynamic
(opaque) error source, suppressing the warning.
- Two UAFs found by adversarial review and fixed: (1) cancel-before-park
orphaned io.sleep's armed timer — suspend_raw's pre-park raise now evicts the
current fiber's timer/waiter first; (2) cancel(f) could wake a reaped worker —
now only wakes when was_pending.
Migrated 1805/1806/1824 to failable workers. Lock: example 1825 (seq: 1 -99,
post-suspend line never runs); byte-identical on aarch64-macOS + aarch64-linux.
.ir churn is the SpawnOpts layout change (type-table string renumbering).
200 lines
15 KiB
Markdown
200 lines
15 KiB
Markdown
# PLAN-IO-UNIFY — fold the fiber scheduler behind `context.io`, re-home `race`
|
||
|
||
## Why
|
||
Today there are **two parallel async stacks**:
|
||
|
||
| stack | behind `context.io`? | real suspension? | cancellation channel |
|
||
|---|---|---|---|
|
||
| io.sx `async`/`await`/`cancel`/`Future` | yes (`impl Io for CBlockingIo`) | **no** — runs the worker inline to completion | `suspend_raw -> !` / `IoErr.Canceled` (designed, unused) |
|
||
| sched.sx `go`/`wait`/`cancel`/`race` (just landed) | **no** | yes (`swap_context` fibers) | none — `suspend_self -> void` |
|
||
|
||
`context.io` is structurally Zig's `std.Io` (an `Io` protocol carried *implicitly* in `Context` — better
|
||
ergonomics than Zig's explicit `io:` param), and the roadmap (§A5, §4.6) already says the fiber
|
||
scheduler should be **one of its `Io` vtables** and that `race` is **`context.io.race(..)` over Futures**.
|
||
The just-landed `race` on `sched.Scheduler` over `*Task` is the proven LOGIC at the wrong LAYER.
|
||
|
||
**Goal:** make the fiber `Scheduler` an `impl Io`, lift `async`/`await`/`cancel`/`race` onto the `Io`
|
||
protocol so they run colorblind under either impl, and let cancellation fall out of the existing
|
||
`suspend_raw -> !` contract (the "true cancellation, model A" the user picked — already the interface's
|
||
design). One async stack, behind `context.io`.
|
||
|
||
## The fiber → `Io` mapping (the crux)
|
||
`Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer }` (core.sx). Map each onto
|
||
the existing fiber primitives in sched.sx (`spawn`/`suspend_self`/`wake`/`sleep`/`block_on_fd`/`run`):
|
||
|
||
| `Io` method | fiber realization |
|
||
|---|---|
|
||
| `spawn_raw(entry, arg, opts) -> *void` | `spawn` a fiber whose body invokes `entry(arg)` (raw C-ABI thunk, not a closure — see Bridge below). Returns the `*Fiber` as the opaque handle. |
|
||
| `suspend_raw(park) -> !` | `suspend_self()`, then on resume CHECK the current task's cancel flag and `raise IoErr.Canceled` if set. `park.handle` = the `*Fiber` to re-ready. **This is the cancellation delivery point.** |
|
||
| `ready(park)` | `wake(park.handle as *Fiber)` (already guarded on `.suspended`). |
|
||
| `arm_timer(deadline_ms, park) -> *void` | arm a `Timer{deadline, fiber=park.handle}` (today's `sleep` minus the self-suspend); return the timer handle so a cancel can evict it. |
|
||
| `poll(deadline_ms) -> i64` | ONE iteration of the `run` loop: drain ready, then fire the earliest timer / block on fds up to `deadline_ms`. Returns the next pending deadline (or sentinel when idle). |
|
||
| `now_ms() -> i64` | the virtual `clock_ms` (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible. |
|
||
|
||
`Scheduler.run()` stays as the explicit DRIVER (the top-level loop that calls `poll` to quiescence),
|
||
installed via `push Context { io = xx scheduler } { … s.run(); }` — exactly the existing sched examples,
|
||
just with the scheduler now reachable as `context.io`.
|
||
|
||
## Status (2026-06-28)
|
||
- **Phase 3 — TRUE cancellation via `suspend_raw -> !`. DONE.** A cancelled async
|
||
worker now abandons its body at its next suspend instead of running to
|
||
completion. Pieces:
|
||
- **Cancel-flag back-ref (D4 — back-ref pointer, chosen):** `SpawnOpts.cancel_flag:
|
||
*void` (core.sx) + `Fiber.cancel_flag: *void` (sched.sx), set from
|
||
`opts.cancel_flag` in `Scheduler.spawn_raw`. `async` passes `xx @f.canceled`
|
||
(the `Future.canceled` `Atomic(bool)` erased to `*void`).
|
||
- **Delivery:** `Scheduler.suspend_raw` checks `fiber_canceled(self.current)` (a
|
||
`*Atomic(bool)` load) PRE-park (raise without parking — no deadlock if cancel
|
||
landed before the worker ran) and POST-resume (cancel landed while parked),
|
||
raising `error.Canceled` (a bare `-> !`; set inferred). `cancel(f)` flips the
|
||
sticky flag, marks `.canceled`, and `ready(.{handle=f.task})`s the worker.
|
||
- **Worker is failable** `Closure() -> ($R, !)`: the `async` completion closure
|
||
`f.value = worker() catch { … }` (the captured-failable-closure-call the
|
||
Phase-3-prereq fix enabled) marks `.canceled`/`.failed` and wakes the awaiter;
|
||
the worker's post-suspend side effects never run. New failable `io.sleep(ms)`
|
||
(arm_timer + `try suspend_raw`) is the cancellation point.
|
||
- **Compiler gap fixed:** a `-> !` fn whose only error source is `try`-ing a
|
||
protocol method (`io.suspend_raw`) was wrongly flagged "declared `!` but never
|
||
errors". `collectErrorSites` (error_analysis.zig) now sets a `dyn` flag for a
|
||
`try` of a non-identifier callee (opaque error channel), suppressing the
|
||
warning.
|
||
- **Two UAFs found by adversarial review and FIXED:** (1) cancel-before-park
|
||
orphaned `io.sleep`'s armed timer → `suspend_raw`'s pre-park raise now evicts
|
||
the current fiber's timer/waiter first. (2) `cancel(f)` woke a possibly-reaped
|
||
worker → now only wakes when `was_pending` (`.pending` before the store).
|
||
- Migrated 1805/1806/1824 to failable workers. Lock:
|
||
`examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx` (`seq: 1 -99`
|
||
— post-suspend line never runs). **Validated byte-identical on aarch64-macOS
|
||
host AND aarch64-linux container** (1824 + 1825). Suite 853/0. Expected `.ir`
|
||
churn (SpawnOpts layout) regenerated; no non-`.ir` snapshot changed.
|
||
|
||
|
||
- **Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE.** The
|
||
async completion closure (`b.run = () => { f.value = worker() catch {…} }`)
|
||
captures a failable `worker` and consumes its error channel; the free-variable
|
||
capture analysis (`collectCaptures` in `src/ir/lower/closure.zig`) did not
|
||
descend into the error-handling / context / asm / multi-assign nodes, so
|
||
`worker` was never captured — inside the lambda it resolved against an empty
|
||
scope and typed as `.unresolved` (`catch`/`try` then rejected it). Fixed: added
|
||
`try_expr`, `catch_expr`, `onfail_stmt`, `raise_stmt`, `multi_assign`,
|
||
`push_stmt`, `comptime_expr`, `insert_expr`, `spread_expr`, `asm_expr` arms to
|
||
`collectCaptures`. Adversarially reviewed (captures resolve, locals correctly
|
||
excluded, no false-positive captures, 851/0). Lock: example
|
||
`examples/closures/0314-closures-capture-failable-call.sx` (catch + try over a
|
||
captured failable closure; pure language feature, host-only). The `push_stmt`
|
||
arm also fixes the previously-noted "free-var analysis doesn't descend into a
|
||
nested `push Context {…}`" gap. **Phase 3 is now unblocked.**
|
||
- Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked
|
||
Phase 3): (1) calling a closure stored in a **struct data field** typed as
|
||
`unresolved` (value → garbage; failable → can't `catch`) — **RESOLVED**
|
||
(`issues/0201`): `CallResolver.plan` gained a closure/fn-pointer field arm and
|
||
the lowering closure-field arm now also handles bare `.function` fields;
|
||
regression `examples/closures/0315-closures-struct-field-call.sx`. (2) asm
|
||
write-through place through a deref (`asm { … "+r" -> @(p.*) }`) fails LLVM
|
||
verification — repros with NO closure (independent of capture analysis);
|
||
possibly an unsupported deref-place form rather than a confirmed bug, not
|
||
filed.
|
||
|
||
## Status (2026-06-27)
|
||
- **Phase 0 — fibers inherit the spawn-time context. DONE** (`2f2d7f1d`). Discovered during Phase 1: a
|
||
fiber body ran under `__sx_default_context` (the `abi(.c)` `fib_dispatch` dropped the implicit
|
||
context), so a scheduler installed as `context.io` was invisible inside a worker. Fixed:
|
||
`Scheduler.spawn` snapshots `context` → `Fiber.dctx`; `fib_dispatch` re-pushes it. Behavior-preserving
|
||
(suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822.
|
||
- **Phase 1 — `impl Io for Scheduler`. DONE** (`5c30bfe0`, hardened `da7dd1f1`). Six methods over the
|
||
fiber primitives; `spawn_raw` bridges the erased `(*void)->void` worker thunk via an fn-ptr round-trip.
|
||
Lock: example 1823 (spawn→arm→suspend→ready→resume entirely through `context.io`, deterministic).
|
||
Adversarial review fixed: `arm_timer`/`spawn_raw` null guards, `poll` fd-pending abort + `deadline_ms`
|
||
doc, stale `fib_dispatch` comment.
|
||
- **Resolved design decisions:** D1 = direct `impl Io for Scheduler` (chosen). D2 = `now_ms` returns the
|
||
virtual `clock_ms` (deterministic) — a real-clock variant is later. D4 = deferred to Phase 3.
|
||
- **Phase 2 — `async`/`await` colorblind over the fiber Io. DONE** (`967aed67`, hardened `ada8d162`).
|
||
`async` heap-allocs a `*Future`, boxes a completion closure in a monomorphic `ThunkBox`, and submits
|
||
via `io.spawn_raw` (inline under `CBlockingIo`, a fiber under the scheduler); `await` parks via
|
||
`suspend_raw` until ready. Protocol changed to `suspend_raw(park: *ParkToken)` (write-back of the
|
||
awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adopted `push .{ … }`. Lock:
|
||
example 1824 (deferral visible: `1 2 10 20 123`). Review fixed: one-awaiter `await` guard; documented
|
||
the Future allocator-lifetime contract + that `cancel` doesn't stop an already-spawned worker (Phase 3).
|
||
- **Resolved D2 (ParkToken):** `suspend_raw(*ParkToken)` write-back (chosen over a registry). **ready()
|
||
liveness (CONCERN 6):** safe for single async/await (awaiter is suspended, not reaped, when readied);
|
||
`race` fan-in must still deregister (Phase 4).
|
||
- **Carried to convergence:** `async` should capture the scheduler's long-lived allocator (like
|
||
`sched.go`'s `own_allocator`) instead of the call-site `context.allocator` — needs a protocol
|
||
affordance; documented as a contract for now.
|
||
- **Open for later phases:**
|
||
- **ParkToken↔fiber binding.** `ready(park)` needs `park.handle` = the awaiter `*Fiber`. The scheduler
|
||
knows `self.current` at suspend; the cleanest is `suspend_raw(park: *ParkToken)` writing
|
||
`park.handle = self.current` before parking (a small protocol change: the materializer installs
|
||
thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry.
|
||
- **`ready()` liveness (review CONCERN 6).** Casting a stale/reaped `*Fiber` handle and `wake`-ing it is
|
||
a latent UAF once real `await` runs — `wake`'s `.suspended` value-check on freed bytes is luck, not
|
||
safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
|
||
- **Out-of-scope compiler bug found by review (not filed yet):** closure free-var analysis does not
|
||
descend into a nested `push Context {…}` block inside a closure body — a var used only there reports
|
||
`unresolved`. Phase 0 sidesteps it (capture is at the `Fiber` level, not via closure), so it does NOT
|
||
block the unification; worth an `issues/` entry in a separate session.
|
||
|
||
## Phases (each: implement → lock with an example → `zig build test` green → both platforms)
|
||
|
||
1. **`impl Io for Scheduler` (the vehicle).** Implement the six methods over the fiber primitives. Add
|
||
a `Fiber.canceled`/task back-ref so `suspend_raw` can raise on resume. Keep `CBlockingIo` intact.
|
||
Lock: install the fiber Io into `context.io`, run a root fiber that `suspend_raw`s and is `ready()`'d —
|
||
asserts real park/resume through the protocol (not inline). **Bridge** (the one fiddly bit): `async`'s
|
||
generic `Closure(..$args) -> $R` worker → `spawn_raw`'s raw `entry/arg`. Box the worker thunk on the
|
||
heap; `entry` is a C-ABI `(env: *void) -> void` invoke-thunk (mirrors `fib_dispatch`), `arg` is the env.
|
||
|
||
2. **`async`/`await` over the fiber Io (real interleaving).** Under a suspending Io, `async` calls
|
||
`spawn_raw` and returns a PENDING `Future($R)` (no longer born `.ready`); the spawned body fills
|
||
`f.value`/`f.state` and `ready(f.park)`s the awaiter. `await(f)` checks `.ready` else `suspend_raw(f.park)`
|
||
then returns/raises — the suspending sibling of today's immediate `await`. `CBlockingIo` keeps the
|
||
run-inline path (degenerate, still correct). Lock: two `context.io.async` tasks interleave under the
|
||
fiber Io (the io.sx layer, replacing the bespoke `sched.go`).
|
||
|
||
3. **True cancellation via `suspend_raw -> !`.** `cancel(f)` flips `f.canceled` AND `ready(f.park)`s /
|
||
wakes the worker fiber so its NEXT `suspend_raw` raises `IoErr.Canceled`. The worker's suspends
|
||
(`await`, a future `io.sleep`) propagate via `try`/`!`; the worker body unwinds, the future ends
|
||
`.canceled`, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now
|
||
delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend
|
||
(assert via a shared log: the post-suspend line never prints).
|
||
|
||
4. **`race` over Futures — `context.io.race((a: fa, b: fb))`.** Re-home the proven race logic (winner
|
||
scan, deregister-all-on-wake, structured cancel+join of losers) from `sched.race(*Task tuple)` onto
|
||
`*Future` handles + the `Io` protocol. The type-level machinery ports UNCHANGED — `RaceResult($T)`,
|
||
`make_variant`, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps `*Task`→`*Future`
|
||
and `suspend_self`→`suspend_raw`/`ready`. Cancellation of losers now uses Phase 3 (their next suspend
|
||
raises), so `race` returns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 at
|
||
`context.io.race`; assert winner value + losers' work stopped (not merely flagged).
|
||
|
||
5. **Converge — retire the bespoke fiber async API.** Fold `sched.go`/`wait`/`cancel`/`race` into the
|
||
io.sx layer; `Scheduler` stays as the fiber Io's engine + driver. Migrate 1811–1821 to the
|
||
`context.io` API. One async stack, all behind the protocol. Update the roadmap/checkpoints.
|
||
|
||
## Open decisions (need a call before/within the phase noted)
|
||
- **D1 (Phase 1) — `impl Io for Scheduler` vs a `FiberIo` wrapper.** Direct impl makes `context.io` BE the
|
||
scheduler (`xx scheduler` as the Io value, stateful receiver — mirrors the allocator `xx local` rule).
|
||
A wrapper adds a level but decouples the public Io vtable from the scheduler internals. *Lean: direct
|
||
impl* (simplest, matches the allocator convention).
|
||
- **D2 (Phase 1) — virtual vs real clock under the fiber Io.** Tests need the deterministic virtual clock
|
||
(`clock_ms`); a real deployment wants `time.mono_ms`. Thread it as a Scheduler mode, or two Io impls
|
||
(`FiberIo` virtual-clock for tests, real-clock for prod). *Lean: a `clock: enum { virtual; real }` field
|
||
so one impl serves both; tests pin `.virtual`.*
|
||
- **D3 (Phase 2) — `Future(void)` (issue 0150 SIGTRAP).** A `void`-result task can't build `Future(void)`
|
||
today. Defer (race/async target non-void), or fix the `void` struct-field path. *Lean: defer, gate with
|
||
a diagnostic.*
|
||
- **D4 (Phase 3) — where the cancel flag lives.** The `Future` already has `canceled: Atomic(bool)`; the
|
||
fiber needs to reach it from `suspend_raw`. Give `Fiber` a `*Atomic(bool)` back-ref to its future's flag
|
||
(set at `spawn_raw`), so `suspend_raw` consults it with no per-suspend lookup. *Lean: back-ref pointer.*
|
||
|
||
## Validation (every phase)
|
||
- `zig build && zig build test` green (full corpus).
|
||
- New/changed `18xx` examples byte-identical on aarch64-macOS host AND aarch64-linux container
|
||
(deterministic virtual clock).
|
||
- Adversarial review of each phase (worker + read-only reviewer), per the session workflow.
|
||
|
||
## What this supersedes
|
||
- `sched.sx`'s bespoke `go`/`wait`/`cancel`/`race` (Phase 5 retires them; the proven logic moves onto the
|
||
protocol). The just-landed `race` (commit `9099735e`) is the reference logic for Phase 4, not the final
|
||
home.
|
||
- PLAN-RACE.md's "race on `sched.Scheduler`" framing — this plan moves it onto `context.io` per the
|
||
roadmap's §A5 / §4.6 design-of-record.
|