# PLAN-IO-UNIFY — fold the fiber scheduler behind `context.io`, re-home `race` ## Why Today there are **two parallel async stacks**: | stack | behind `context.io`? | real suspension? | cancellation channel | |---|---|---|---| | io.sx `async`/`await`/`cancel`/`Future` | yes (`impl Io for CBlockingIo`) | **no** — runs the worker inline to completion | `suspend_raw -> !` / `IoErr.Canceled` (designed, unused) | | sched.sx `go`/`wait`/`cancel`/`race` (just landed) | **no** | yes (`swap_context` fibers) | none — `suspend_self -> void` | `context.io` is structurally Zig's `std.Io` (an `Io` protocol carried *implicitly* in `Context` — better ergonomics than Zig's explicit `io:` param), and the roadmap (§A5, §4.6) already says the fiber scheduler should be **one of its `Io` vtables** and that `race` is **`context.io.race(..)` over Futures**. The just-landed `race` on `sched.Scheduler` over `*Task` is the proven LOGIC at the wrong LAYER. **Goal:** make the fiber `Scheduler` an `impl Io`, lift `async`/`await`/`cancel`/`race` onto the `Io` protocol so they run colorblind under either impl, and let cancellation fall out of the existing `suspend_raw -> !` contract (the "true cancellation, model A" the user picked — already the interface's design). One async stack, behind `context.io`. ## The fiber → `Io` mapping (the crux) `Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer }` (core.sx). Map each onto the existing fiber primitives in sched.sx (`spawn`/`suspend_self`/`wake`/`sleep`/`block_on_fd`/`run`): | `Io` method | fiber realization | |---|---| | `spawn_raw(entry, arg, opts) -> *void` | `spawn` a fiber whose body invokes `entry(arg)` (raw C-ABI thunk, not a closure — see Bridge below). Returns the `*Fiber` as the opaque handle. | | `suspend_raw(park) -> !` | `suspend_self()`, then on resume CHECK the current task's cancel flag and `raise IoErr.Canceled` if set. `park.handle` = the `*Fiber` to re-ready. **This is the cancellation delivery point.** | | `ready(park)` | `wake(park.handle as *Fiber)` (already guarded on `.suspended`). | | `arm_timer(deadline_ms, park) -> *void` | arm a `Timer{deadline, fiber=park.handle}` (today's `sleep` minus the self-suspend); return the timer handle so a cancel can evict it. | | `poll(deadline_ms) -> i64` | ONE iteration of the `run` loop: drain ready, then fire the earliest timer / block on fds up to `deadline_ms`. Returns the next pending deadline (or sentinel when idle). | | `now_ms() -> i64` | the virtual `clock_ms` (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible. | `Scheduler.run()` stays as the explicit DRIVER (the top-level loop that calls `poll` to quiescence), installed via `push Context { io = xx scheduler } { … s.run(); }` — exactly the existing sched examples, just with the scheduler now reachable as `context.io`. ## Status (2026-06-28) - **Phase 3 — TRUE cancellation via `suspend_raw -> !`. DONE.** A cancelled async worker now abandons its body at its next suspend instead of running to completion. Pieces: - **Cancel-flag back-ref (D4 — back-ref pointer, chosen):** `SpawnOpts.cancel_flag: *void` (core.sx) + `Fiber.cancel_flag: *void` (sched.sx), set from `opts.cancel_flag` in `Scheduler.spawn_raw`. `async` passes `xx @f.canceled` (the `Future.canceled` `Atomic(bool)` erased to `*void`). - **Delivery:** `Scheduler.suspend_raw` checks `fiber_canceled(self.current)` (a `*Atomic(bool)` load) PRE-park (raise without parking — no deadlock if cancel landed before the worker ran) and POST-resume (cancel landed while parked), raising `error.Canceled` (a bare `-> !`; set inferred). `cancel(f)` flips the sticky flag, marks `.canceled`, and `ready(.{handle=f.task})`s the worker. - **Worker is failable** `Closure() -> ($R, !)`: the `async` completion closure `f.value = worker() catch { … }` (the captured-failable-closure-call the Phase-3-prereq fix enabled) marks `.canceled`/`.failed` and wakes the awaiter; the worker's post-suspend side effects never run. New failable `io.sleep(ms)` (arm_timer + `try suspend_raw`) is the cancellation point. - **Compiler gap fixed:** a `-> !` fn whose only error source is `try`-ing a protocol method (`io.suspend_raw`) was wrongly flagged "declared `!` but never errors". `collectErrorSites` (error_analysis.zig) now sets a `dyn` flag for a `try` of a non-identifier callee (opaque error channel), suppressing the warning. - **Two UAFs found by adversarial review and FIXED:** (1) cancel-before-park orphaned `io.sleep`'s armed timer → `suspend_raw`'s pre-park raise now evicts the current fiber's timer/waiter first. (2) `cancel(f)` woke a possibly-reaped worker → now only wakes when `was_pending` (`.pending` before the store). - Migrated 1805/1806/1824 to failable workers. Lock: `examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx` (`seq: 1 -99` — post-suspend line never runs). **Validated byte-identical on aarch64-macOS host AND aarch64-linux container** (1824 + 1825). Suite 853/0. Expected `.ir` churn (SpawnOpts layout) regenerated; no non-`.ir` snapshot changed. - **Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE.** The async completion closure (`b.run = () => { f.value = worker() catch {…} }`) captures a failable `worker` and consumes its error channel; the free-variable capture analysis (`collectCaptures` in `src/ir/lower/closure.zig`) did not descend into the error-handling / context / asm / multi-assign nodes, so `worker` was never captured — inside the lambda it resolved against an empty scope and typed as `.unresolved` (`catch`/`try` then rejected it). Fixed: added `try_expr`, `catch_expr`, `onfail_stmt`, `raise_stmt`, `multi_assign`, `push_stmt`, `comptime_expr`, `insert_expr`, `spread_expr`, `asm_expr` arms to `collectCaptures`. Adversarially reviewed (captures resolve, locals correctly excluded, no false-positive captures, 851/0). Lock: example `examples/closures/0314-closures-capture-failable-call.sx` (catch + try over a captured failable closure; pure language feature, host-only). The `push_stmt` arm also fixes the previously-noted "free-var analysis doesn't descend into a nested `push Context {…}`" gap. **Phase 3 is now unblocked.** - Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked Phase 3): (1) calling a closure stored in a **struct data field** typed as `unresolved` (value → garbage; failable → can't `catch`) — **RESOLVED** (`issues/0201`): `CallResolver.plan` gained a closure/fn-pointer field arm and the lowering closure-field arm now also handles bare `.function` fields; regression `examples/closures/0315-closures-struct-field-call.sx`. (2) asm write-through place through a deref (`asm { … "+r" -> @(p.*) }`) fails LLVM verification — repros with NO closure (independent of capture analysis); possibly an unsupported deref-place form rather than a confirmed bug, not filed. ## Status (2026-06-27) - **Phase 0 — fibers inherit the spawn-time context. DONE** (`2f2d7f1d`). Discovered during Phase 1: a fiber body ran under `__sx_default_context` (the `abi(.c)` `fib_dispatch` dropped the implicit context), so a scheduler installed as `context.io` was invisible inside a worker. Fixed: `Scheduler.spawn` snapshots `context` → `Fiber.dctx`; `fib_dispatch` re-pushes it. Behavior-preserving (suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822. - **Phase 1 — `impl Io for Scheduler`. DONE** (`5c30bfe0`, hardened `da7dd1f1`). Six methods over the fiber primitives; `spawn_raw` bridges the erased `(*void)->void` worker thunk via an fn-ptr round-trip. Lock: example 1823 (spawn→arm→suspend→ready→resume entirely through `context.io`, deterministic). Adversarial review fixed: `arm_timer`/`spawn_raw` null guards, `poll` fd-pending abort + `deadline_ms` doc, stale `fib_dispatch` comment. - **Resolved design decisions:** D1 = direct `impl Io for Scheduler` (chosen). D2 = `now_ms` returns the virtual `clock_ms` (deterministic) — a real-clock variant is later. D4 = deferred to Phase 3. - **Phase 2 — `async`/`await` colorblind over the fiber Io. DONE** (`967aed67`, hardened `ada8d162`). `async` heap-allocs a `*Future`, boxes a completion closure in a monomorphic `ThunkBox`, and submits via `io.spawn_raw` (inline under `CBlockingIo`, a fiber under the scheduler); `await` parks via `suspend_raw` until ready. Protocol changed to `suspend_raw(park: *ParkToken)` (write-back of the awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adopted `push .{ … }`. Lock: example 1824 (deferral visible: `1 2 10 20 123`). Review fixed: one-awaiter `await` guard; documented the Future allocator-lifetime contract + that `cancel` doesn't stop an already-spawned worker (Phase 3). - **Resolved D2 (ParkToken):** `suspend_raw(*ParkToken)` write-back (chosen over a registry). **ready() liveness (CONCERN 6):** safe for single async/await (awaiter is suspended, not reaped, when readied); `race` fan-in must still deregister (Phase 4). - **Carried to convergence:** `async` should capture the scheduler's long-lived allocator (like `sched.go`'s `own_allocator`) instead of the call-site `context.allocator` — needs a protocol affordance; documented as a contract for now. - **Open for later phases:** - **ParkToken↔fiber binding.** `ready(park)` needs `park.handle` = the awaiter `*Fiber`. The scheduler knows `self.current` at suspend; the cleanest is `suspend_raw(park: *ParkToken)` writing `park.handle = self.current` before parking (a small protocol change: the materializer installs thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry. - **`ready()` liveness (review CONCERN 6).** Casting a stale/reaped `*Fiber` handle and `wake`-ing it is a latent UAF once real `await` runs — `wake`'s `.suspended` value-check on freed bytes is luck, not safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister). - **Out-of-scope compiler bug found by review (not filed yet):** closure free-var analysis does not descend into a nested `push Context {…}` block inside a closure body — a var used only there reports `unresolved`. Phase 0 sidesteps it (capture is at the `Fiber` level, not via closure), so it does NOT block the unification; worth an `issues/` entry in a separate session. ## Phases (each: implement → lock with an example → `zig build test` green → both platforms) 1. **`impl Io for Scheduler` (the vehicle).** Implement the six methods over the fiber primitives. Add a `Fiber.canceled`/task back-ref so `suspend_raw` can raise on resume. Keep `CBlockingIo` intact. Lock: install the fiber Io into `context.io`, run a root fiber that `suspend_raw`s and is `ready()`'d — asserts real park/resume through the protocol (not inline). **Bridge** (the one fiddly bit): `async`'s generic `Closure(..$args) -> $R` worker → `spawn_raw`'s raw `entry/arg`. Box the worker thunk on the heap; `entry` is a C-ABI `(env: *void) -> void` invoke-thunk (mirrors `fib_dispatch`), `arg` is the env. 2. **`async`/`await` over the fiber Io (real interleaving).** Under a suspending Io, `async` calls `spawn_raw` and returns a PENDING `Future($R)` (no longer born `.ready`); the spawned body fills `f.value`/`f.state` and `ready(f.park)`s the awaiter. `await(f)` checks `.ready` else `suspend_raw(f.park)` then returns/raises — the suspending sibling of today's immediate `await`. `CBlockingIo` keeps the run-inline path (degenerate, still correct). Lock: two `context.io.async` tasks interleave under the fiber Io (the io.sx layer, replacing the bespoke `sched.go`). 3. **True cancellation via `suspend_raw -> !`.** `cancel(f)` flips `f.canceled` AND `ready(f.park)`s / wakes the worker fiber so its NEXT `suspend_raw` raises `IoErr.Canceled`. The worker's suspends (`await`, a future `io.sleep`) propagate via `try`/`!`; the worker body unwinds, the future ends `.canceled`, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend (assert via a shared log: the post-suspend line never prints). 4. **`race` over Futures — `context.io.race((a: fa, b: fb))`.** Re-home the proven race logic (winner scan, deregister-all-on-wake, structured cancel+join of losers) from `sched.race(*Task tuple)` onto `*Future` handles + the `Io` protocol. The type-level machinery ports UNCHANGED — `RaceResult($T)`, `make_variant`, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps `*Task`→`*Future` and `suspend_self`→`suspend_raw`/`ready`. Cancellation of losers now uses Phase 3 (their next suspend raises), so `race` returns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 at `context.io.race`; assert winner value + losers' work stopped (not merely flagged). 5. **Converge — retire the bespoke fiber async API.** Fold `sched.go`/`wait`/`cancel`/`race` into the io.sx layer; `Scheduler` stays as the fiber Io's engine + driver. Migrate 1811–1821 to the `context.io` API. One async stack, all behind the protocol. Update the roadmap/checkpoints. ## Open decisions (need a call before/within the phase noted) - **D1 (Phase 1) — `impl Io for Scheduler` vs a `FiberIo` wrapper.** Direct impl makes `context.io` BE the scheduler (`xx scheduler` as the Io value, stateful receiver — mirrors the allocator `xx local` rule). A wrapper adds a level but decouples the public Io vtable from the scheduler internals. *Lean: direct impl* (simplest, matches the allocator convention). - **D2 (Phase 1) — virtual vs real clock under the fiber Io.** Tests need the deterministic virtual clock (`clock_ms`); a real deployment wants `time.mono_ms`. Thread it as a Scheduler mode, or two Io impls (`FiberIo` virtual-clock for tests, real-clock for prod). *Lean: a `clock: enum { virtual; real }` field so one impl serves both; tests pin `.virtual`.* - **D3 (Phase 2) — `Future(void)` (issue 0150 SIGTRAP).** A `void`-result task can't build `Future(void)` today. Defer (race/async target non-void), or fix the `void` struct-field path. *Lean: defer, gate with a diagnostic.* - **D4 (Phase 3) — where the cancel flag lives.** The `Future` already has `canceled: Atomic(bool)`; the fiber needs to reach it from `suspend_raw`. Give `Fiber` a `*Atomic(bool)` back-ref to its future's flag (set at `spawn_raw`), so `suspend_raw` consults it with no per-suspend lookup. *Lean: back-ref pointer.* ## Validation (every phase) - `zig build && zig build test` green (full corpus). - New/changed `18xx` examples byte-identical on aarch64-macOS host AND aarch64-linux container (deterministic virtual clock). - Adversarial review of each phase (worker + read-only reviewer), per the session workflow. ## What this supersedes - `sched.sx`'s bespoke `go`/`wait`/`cancel`/`race` (Phase 5 retires them; the proven logic moves onto the protocol). The just-landed `race` (commit `9099735e`) is the reference logic for Phase 4, not the final home. - PLAN-RACE.md's "race on `sched.Scheduler`" framing — this plan moves it onto `context.io` per the roadmap's §A5 / §4.6 design-of-record.