Files
sx/current/PLAN-IO-UNIFY.md

11 KiB
Raw Blame History

PLAN-IO-UNIFY — fold the fiber scheduler behind context.io, re-home race

Why

Today there are two parallel async stacks:

stack behind context.io? real suspension? cancellation channel
io.sx async/await/cancel/Future yes (impl Io for CBlockingIo) no — runs the worker inline to completion suspend_raw -> ! / IoErr.Canceled (designed, unused)
sched.sx go/wait/cancel/race (just landed) no yes (swap_context fibers) none — suspend_self -> void

context.io is structurally Zig's std.Io (an Io protocol carried implicitly in Context — better ergonomics than Zig's explicit io: param), and the roadmap (§A5, §4.6) already says the fiber scheduler should be one of its Io vtables and that race is context.io.race(..) over Futures. The just-landed race on sched.Scheduler over *Task is the proven LOGIC at the wrong LAYER.

Goal: make the fiber Scheduler an impl Io, lift async/await/cancel/race onto the Io protocol so they run colorblind under either impl, and let cancellation fall out of the existing suspend_raw -> ! contract (the "true cancellation, model A" the user picked — already the interface's design). One async stack, behind context.io.

The fiber → Io mapping (the crux)

Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer } (core.sx). Map each onto the existing fiber primitives in sched.sx (spawn/suspend_self/wake/sleep/block_on_fd/run):

Io method fiber realization
spawn_raw(entry, arg, opts) -> *void spawn a fiber whose body invokes entry(arg) (raw C-ABI thunk, not a closure — see Bridge below). Returns the *Fiber as the opaque handle.
suspend_raw(park) -> ! suspend_self(), then on resume CHECK the current task's cancel flag and raise IoErr.Canceled if set. park.handle = the *Fiber to re-ready. This is the cancellation delivery point.
ready(park) wake(park.handle as *Fiber) (already guarded on .suspended).
arm_timer(deadline_ms, park) -> *void arm a Timer{deadline, fiber=park.handle} (today's sleep minus the self-suspend); return the timer handle so a cancel can evict it.
poll(deadline_ms) -> i64 ONE iteration of the run loop: drain ready, then fire the earliest timer / block on fds up to deadline_ms. Returns the next pending deadline (or sentinel when idle).
now_ms() -> i64 the virtual clock_ms (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible.

Scheduler.run() stays as the explicit DRIVER (the top-level loop that calls poll to quiescence), installed via push Context { io = xx scheduler } { … s.run(); } — exactly the existing sched examples, just with the scheduler now reachable as context.io.

Status (2026-06-27)

  • Phase 0 — fibers inherit the spawn-time context. DONE (2f2d7f1d). Discovered during Phase 1: a fiber body ran under __sx_default_context (the abi(.c) fib_dispatch dropped the implicit context), so a scheduler installed as context.io was invisible inside a worker. Fixed: Scheduler.spawn snapshots contextFiber.dctx; fib_dispatch re-pushes it. Behavior-preserving (suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822.
  • Phase 1 — impl Io for Scheduler. DONE (5c30bfe0, hardened da7dd1f1). Six methods over the fiber primitives; spawn_raw bridges the erased (*void)->void worker thunk via an fn-ptr round-trip. Lock: example 1823 (spawn→arm→suspend→ready→resume entirely through context.io, deterministic). Adversarial review fixed: arm_timer/spawn_raw null guards, poll fd-pending abort + deadline_ms doc, stale fib_dispatch comment.
  • Resolved design decisions: D1 = direct impl Io for Scheduler (chosen). D2 = now_ms returns the virtual clock_ms (deterministic) — a real-clock variant is later. D4 = deferred to Phase 3.
  • Phase 2 — async/await colorblind over the fiber Io. DONE (967aed67, hardened ada8d162). async heap-allocs a *Future, boxes a completion closure in a monomorphic ThunkBox, and submits via io.spawn_raw (inline under CBlockingIo, a fiber under the scheduler); await parks via suspend_raw until ready. Protocol changed to suspend_raw(park: *ParkToken) (write-back of the awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adopted push .{ … }. Lock: example 1824 (deferral visible: 1 2 10 20 123). Review fixed: one-awaiter await guard; documented the Future allocator-lifetime contract + that cancel doesn't stop an already-spawned worker (Phase 3).
    • Resolved D2 (ParkToken): suspend_raw(*ParkToken) write-back (chosen over a registry). ready() liveness (CONCERN 6): safe for single async/await (awaiter is suspended, not reaped, when readied); race fan-in must still deregister (Phase 4).
    • Carried to convergence: async should capture the scheduler's long-lived allocator (like sched.go's own_allocator) instead of the call-site context.allocator — needs a protocol affordance; documented as a contract for now.
  • Open for later phases:
    • ParkToken↔fiber binding. ready(park) needs park.handle = the awaiter *Fiber. The scheduler knows self.current at suspend; the cleanest is suspend_raw(park: *ParkToken) writing park.handle = self.current before parking (a small protocol change: the materializer installs thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry.
    • ready() liveness (review CONCERN 6). Casting a stale/reaped *Fiber handle and wake-ing it is a latent UAF once real await runs — wake's .suspended value-check on freed bytes is luck, not safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
  • Out-of-scope compiler bug found by review (not filed yet): closure free-var analysis does not descend into a nested push Context {…} block inside a closure body — a var used only there reports unresolved. Phase 0 sidesteps it (capture is at the Fiber level, not via closure), so it does NOT block the unification; worth an issues/ entry in a separate session.

Phases (each: implement → lock with an example → zig build test green → both platforms)

  1. impl Io for Scheduler (the vehicle). Implement the six methods over the fiber primitives. Add a Fiber.canceled/task back-ref so suspend_raw can raise on resume. Keep CBlockingIo intact. Lock: install the fiber Io into context.io, run a root fiber that suspend_raws and is ready()'d — asserts real park/resume through the protocol (not inline). Bridge (the one fiddly bit): async's generic Closure(..$args) -> $R worker → spawn_raw's raw entry/arg. Box the worker thunk on the heap; entry is a C-ABI (env: *void) -> void invoke-thunk (mirrors fib_dispatch), arg is the env.

  2. async/await over the fiber Io (real interleaving). Under a suspending Io, async calls spawn_raw and returns a PENDING Future($R) (no longer born .ready); the spawned body fills f.value/f.state and ready(f.park)s the awaiter. await(f) checks .ready else suspend_raw(f.park) then returns/raises — the suspending sibling of today's immediate await. CBlockingIo keeps the run-inline path (degenerate, still correct). Lock: two context.io.async tasks interleave under the fiber Io (the io.sx layer, replacing the bespoke sched.go).

  3. True cancellation via suspend_raw -> !. cancel(f) flips f.canceled AND ready(f.park)s / wakes the worker fiber so its NEXT suspend_raw raises IoErr.Canceled. The worker's suspends (await, a future io.sleep) propagate via try/!; the worker body unwinds, the future ends .canceled, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend (assert via a shared log: the post-suspend line never prints).

  4. race over Futures — context.io.race((a: fa, b: fb)). Re-home the proven race logic (winner scan, deregister-all-on-wake, structured cancel+join of losers) from sched.race(*Task tuple) onto *Future handles + the Io protocol. The type-level machinery ports UNCHANGED — RaceResult($T), make_variant, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps *Task*Future and suspend_selfsuspend_raw/ready. Cancellation of losers now uses Phase 3 (their next suspend raises), so race returns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 at context.io.race; assert winner value + losers' work stopped (not merely flagged).

  5. Converge — retire the bespoke fiber async API. Fold sched.go/wait/cancel/race into the io.sx layer; Scheduler stays as the fiber Io's engine + driver. Migrate 18111821 to the context.io API. One async stack, all behind the protocol. Update the roadmap/checkpoints.

Open decisions (need a call before/within the phase noted)

  • D1 (Phase 1) — impl Io for Scheduler vs a FiberIo wrapper. Direct impl makes context.io BE the scheduler (xx scheduler as the Io value, stateful receiver — mirrors the allocator xx local rule). A wrapper adds a level but decouples the public Io vtable from the scheduler internals. Lean: direct impl (simplest, matches the allocator convention).
  • D2 (Phase 1) — virtual vs real clock under the fiber Io. Tests need the deterministic virtual clock (clock_ms); a real deployment wants time.mono_ms. Thread it as a Scheduler mode, or two Io impls (FiberIo virtual-clock for tests, real-clock for prod). Lean: a clock: enum { virtual; real } field so one impl serves both; tests pin .virtual.
  • D3 (Phase 2) — Future(void) (issue 0150 SIGTRAP). A void-result task can't build Future(void) today. Defer (race/async target non-void), or fix the void struct-field path. Lean: defer, gate with a diagnostic.
  • D4 (Phase 3) — where the cancel flag lives. The Future already has canceled: Atomic(bool); the fiber needs to reach it from suspend_raw. Give Fiber a *Atomic(bool) back-ref to its future's flag (set at spawn_raw), so suspend_raw consults it with no per-suspend lookup. Lean: back-ref pointer.

Validation (every phase)

  • zig build && zig build test green (full corpus).
  • New/changed 18xx examples byte-identical on aarch64-macOS host AND aarch64-linux container (deterministic virtual clock).
  • Adversarial review of each phase (worker + read-only reviewer), per the session workflow.

What this supersedes

  • sched.sx's bespoke go/wait/cancel/race (Phase 5 retires them; the proven logic moves onto the protocol). The just-landed race (commit 9099735e) is the reference logic for Phase 4, not the final home.
  • PLAN-RACE.md's "race on sched.Scheduler" framing — this plan moves it onto context.io per the roadmap's §A5 / §4.6 design-of-record.