Files
sx/current/PLAN-IO-UNIFY.md
agra aae7d72a66 refactor: retire bespoke Task async; one stack behind context.io (Phase 5)
Converge the Io unification (PLAN-IO-UNIFY Phase 5). The bespoke fiber-task layer
in sched.sx — Task / TaskState / TaskErr / go / wait / cancel(Task), plus
Scheduler.task_allocs and its deinit bookkeeping (~130 lines) — is removed. There
is now ONE async stack: context.io.async / await / cancel / race / sleep over the
Io protocol, with the Scheduler as the fiber Io's engine + driver (spawn /
yield_now / suspend_self / wake / run / block_on_fd remain as the raw primitives;
race stays in sched.sx because it needs meta.sx's make_enum/make_variant).

Migrated the four go/wait users to context.io:
- 1813 — interleave + cancel (sequence 1 2 3 42 100 -99)
- 1817 — m1 end-to-end (completion in deadline order, sum 123)
- 1819 — double-AWAIT loud-abort via the Future one-awaiter guard
- 1820 — deinit: dropped the go/task_allocs tasks; now exercises timers/io_waiters/
  kq cleanup (freed=2, live=3 = the documented per-spawn closure-env residual)

Updated readme.md (the user-facing async section documents context.io.async /
await / race / sleep) and the stale sched.go/sched.Task comments in io.sx.

Suite 854/0; no .ir churn (Task removal touched no snapshotted IR); migrated
examples byte-identical on aarch64-macOS + aarch64-linux. PLAN-IO-UNIFY Phases 0-5
all complete — the two parallel async stacks are now one, behind context.io.
2026-06-28 10:14:17 +03:00

18 KiB
Raw Blame History

PLAN-IO-UNIFY — fold the fiber scheduler behind context.io, re-home race

Why

Today there are two parallel async stacks:

stack behind context.io? real suspension? cancellation channel
io.sx async/await/cancel/Future yes (impl Io for CBlockingIo) no — runs the worker inline to completion suspend_raw -> ! / IoErr.Canceled (designed, unused)
sched.sx go/wait/cancel/race (just landed) no yes (swap_context fibers) none — suspend_self -> void

context.io is structurally Zig's std.Io (an Io protocol carried implicitly in Context — better ergonomics than Zig's explicit io: param), and the roadmap (§A5, §4.6) already says the fiber scheduler should be one of its Io vtables and that race is context.io.race(..) over Futures. The just-landed race on sched.Scheduler over *Task is the proven LOGIC at the wrong LAYER.

Goal: make the fiber Scheduler an impl Io, lift async/await/cancel/race onto the Io protocol so they run colorblind under either impl, and let cancellation fall out of the existing suspend_raw -> ! contract (the "true cancellation, model A" the user picked — already the interface's design). One async stack, behind context.io.

The fiber → Io mapping (the crux)

Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer } (core.sx). Map each onto the existing fiber primitives in sched.sx (spawn/suspend_self/wake/sleep/block_on_fd/run):

Io method fiber realization
spawn_raw(entry, arg, opts) -> *void spawn a fiber whose body invokes entry(arg) (raw C-ABI thunk, not a closure — see Bridge below). Returns the *Fiber as the opaque handle.
suspend_raw(park) -> ! suspend_self(), then on resume CHECK the current task's cancel flag and raise IoErr.Canceled if set. park.handle = the *Fiber to re-ready. This is the cancellation delivery point.
ready(park) wake(park.handle as *Fiber) (already guarded on .suspended).
arm_timer(deadline_ms, park) -> *void arm a Timer{deadline, fiber=park.handle} (today's sleep minus the self-suspend); return the timer handle so a cancel can evict it.
poll(deadline_ms) -> i64 ONE iteration of the run loop: drain ready, then fire the earliest timer / block on fds up to deadline_ms. Returns the next pending deadline (or sentinel when idle).
now_ms() -> i64 the virtual clock_ms (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible.

Scheduler.run() stays as the explicit DRIVER (the top-level loop that calls poll to quiescence), installed via push Context { io = xx scheduler } { … s.run(); } — exactly the existing sched examples, just with the scheduler now reachable as context.io.

Status (2026-06-28)

  • Phase 5 — CONVERGE: retire the bespoke fiber async API. DONE. Io unification COMPLETE. The bespoke Task layer (Task/TaskState/TaskErr/go/wait/ cancel(Task) + Scheduler.task_allocs and its deinit handling, ~130 lines) is removed from sched.sx. There is now ONE async stack: context.io.async/ await/cancel/race/sleep over the Io protocol, with the Scheduler as the fiber Io's engine + driver (spawn/yield_now/suspend_self/wake/run/ block_on_fd stay as the raw primitives). Migrated the four go/wait users to context.io: 1813 (interleave + cancel), 1817 (m1 end-to-end sum=123), 1819 (double-AWAIT loud-abort via the Future one-awaiter guard), 1820 (deinit — the go/task_allocs tasks dropped; it now exercises timers/io_waiters/kq cleanup, freed=2/live=3). race stays in sched.sx (needs meta.sx). Updated readme.md (the user-facing async section now documents context.io.async/await/race/ sleep) and the stale sched.go/sched.Task comments in io.sx. Suite 854/0; no .ir churn (the Task removal touched no snapshotted IR); migrated examples byte-identical on aarch64-macOS + aarch64-linux. PLAN-IO-UNIFY Phases 05 all complete — the two parallel async stacks are now one, behind context.io.

  • Phase 4 — race over Futures via context.io.race. DONE. Re-homed the proven first-wins race from sched.race(*Task) onto *Future handles + the Io protocol; the old Task-based race is REPLACED (ufcs overload-by-receiver is rejected — "duplicate top-level decl" — and only 1821 used it).

    • Protocol affordance: added Io.current_park() -> ParkToken (the running fiber as a token, captured WITHOUT parking) so race can register the SAME coordinator across N futures' park slots, then park once via suspend_raw; any completion readys it. Scheduler returns {self.current} (bails outside a fiber); CBlockingIo returns {null} (race never parks there — futures born .ready). The await comment already anticipated this fan-in.
    • race (ufcs (io: Io, futures: $T) -> RaceResult(T), in sched.sx — it needs meta.sx's make_enum/make_variant, and pulling that into the io.sx prelude part-file would cycle): winner scan → register+park → deregister → make_variant the winner → Phase-3 cancel each loser (NO join). RaceResult reused unchanged (*Future(R) projects field 0 value → R).
    • Winner-time return: with true cancellation the parked losers stop at their next suspend (their timers evicted by cancel's wake), so race returns at the winner's virtual time, not the slowest loser's. 1821 re-pointed to context.io.async + context.io.race: winner a=111, losers .canceled, completion log ONLY task 1 @ 10ms, final clock 10ms (was 30 under the old cooperative join). Byte-identical on aarch64-macOS + aarch64-linux. Suite 853/0; .ir churn (current_park vtable method) regenerated, only 1821 stdout changed otherwise.
  • Phase 3 — TRUE cancellation via suspend_raw -> !. DONE. A cancelled async worker now abandons its body at its next suspend instead of running to completion. Pieces:

    • Cancel-flag back-ref (D4 — back-ref pointer, chosen): SpawnOpts.cancel_flag: *void (core.sx) + Fiber.cancel_flag: *void (sched.sx), set from opts.cancel_flag in Scheduler.spawn_raw. async passes xx @f.canceled (the Future.canceled Atomic(bool) erased to *void).
    • Delivery: Scheduler.suspend_raw checks fiber_canceled(self.current) (a *Atomic(bool) load) PRE-park (raise without parking — no deadlock if cancel landed before the worker ran) and POST-resume (cancel landed while parked), raising error.Canceled (a bare -> !; set inferred). cancel(f) flips the sticky flag, marks .canceled, and ready(.{handle=f.task})s the worker.
    • Worker is failable Closure() -> ($R, !): the async completion closure f.value = worker() catch { … } (the captured-failable-closure-call the Phase-3-prereq fix enabled) marks .canceled/.failed and wakes the awaiter; the worker's post-suspend side effects never run. New failable io.sleep(ms) (arm_timer + try suspend_raw) is the cancellation point.
    • Compiler gap fixed: a -> ! fn whose only error source is try-ing a protocol method (io.suspend_raw) was wrongly flagged "declared ! but never errors". collectErrorSites (error_analysis.zig) now sets a dyn flag for a try of a non-identifier callee (opaque error channel), suppressing the warning.
    • Two UAFs found by adversarial review and FIXED: (1) cancel-before-park orphaned io.sleep's armed timer → suspend_raw's pre-park raise now evicts the current fiber's timer/waiter first. (2) cancel(f) woke a possibly-reaped worker → now only wakes when was_pending (.pending before the store).
    • Migrated 1805/1806/1824 to failable workers. Lock: examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx (seq: 1 -99 — post-suspend line never runs). Validated byte-identical on aarch64-macOS host AND aarch64-linux container (1824 + 1825). Suite 853/0. Expected .ir churn (SpawnOpts layout) regenerated; no non-.ir snapshot changed.
  • Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE. The async completion closure (b.run = () => { f.value = worker() catch {…} }) captures a failable worker and consumes its error channel; the free-variable capture analysis (collectCaptures in src/ir/lower/closure.zig) did not descend into the error-handling / context / asm / multi-assign nodes, so worker was never captured — inside the lambda it resolved against an empty scope and typed as .unresolved (catch/try then rejected it). Fixed: added try_expr, catch_expr, onfail_stmt, raise_stmt, multi_assign, push_stmt, comptime_expr, insert_expr, spread_expr, asm_expr arms to collectCaptures. Adversarially reviewed (captures resolve, locals correctly excluded, no false-positive captures, 851/0). Lock: example examples/closures/0314-closures-capture-failable-call.sx (catch + try over a captured failable closure; pure language feature, host-only). The push_stmt arm also fixes the previously-noted "free-var analysis doesn't descend into a nested push Context {…}" gap. Phase 3 is now unblocked.

    • Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked Phase 3): (1) calling a closure stored in a struct data field typed as unresolved (value → garbage; failable → can't catch) — RESOLVED (issues/0201): CallResolver.plan gained a closure/fn-pointer field arm and the lowering closure-field arm now also handles bare .function fields; regression examples/closures/0315-closures-struct-field-call.sx. (2) asm write-through place through a deref (asm { … "+r" -> @(p.*) }) fails LLVM verification — repros with NO closure (independent of capture analysis); possibly an unsupported deref-place form rather than a confirmed bug, not filed.

Status (2026-06-27)

  • Phase 0 — fibers inherit the spawn-time context. DONE (2f2d7f1d). Discovered during Phase 1: a fiber body ran under __sx_default_context (the abi(.c) fib_dispatch dropped the implicit context), so a scheduler installed as context.io was invisible inside a worker. Fixed: Scheduler.spawn snapshots contextFiber.dctx; fib_dispatch re-pushes it. Behavior-preserving (suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822.
  • Phase 1 — impl Io for Scheduler. DONE (5c30bfe0, hardened da7dd1f1). Six methods over the fiber primitives; spawn_raw bridges the erased (*void)->void worker thunk via an fn-ptr round-trip. Lock: example 1823 (spawn→arm→suspend→ready→resume entirely through context.io, deterministic). Adversarial review fixed: arm_timer/spawn_raw null guards, poll fd-pending abort + deadline_ms doc, stale fib_dispatch comment.
  • Resolved design decisions: D1 = direct impl Io for Scheduler (chosen). D2 = now_ms returns the virtual clock_ms (deterministic) — a real-clock variant is later. D4 = deferred to Phase 3.
  • Phase 2 — async/await colorblind over the fiber Io. DONE (967aed67, hardened ada8d162). async heap-allocs a *Future, boxes a completion closure in a monomorphic ThunkBox, and submits via io.spawn_raw (inline under CBlockingIo, a fiber under the scheduler); await parks via suspend_raw until ready. Protocol changed to suspend_raw(park: *ParkToken) (write-back of the awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adopted push .{ … }. Lock: example 1824 (deferral visible: 1 2 10 20 123). Review fixed: one-awaiter await guard; documented the Future allocator-lifetime contract + that cancel doesn't stop an already-spawned worker (Phase 3).
    • Resolved D2 (ParkToken): suspend_raw(*ParkToken) write-back (chosen over a registry). ready() liveness (CONCERN 6): safe for single async/await (awaiter is suspended, not reaped, when readied); race fan-in must still deregister (Phase 4).
    • Carried to convergence: async should capture the scheduler's long-lived allocator (like sched.go's own_allocator) instead of the call-site context.allocator — needs a protocol affordance; documented as a contract for now.
  • Open for later phases:
    • ParkToken↔fiber binding. ready(park) needs park.handle = the awaiter *Fiber. The scheduler knows self.current at suspend; the cleanest is suspend_raw(park: *ParkToken) writing park.handle = self.current before parking (a small protocol change: the materializer installs thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry.
    • ready() liveness (review CONCERN 6). Casting a stale/reaped *Fiber handle and wake-ing it is a latent UAF once real await runs — wake's .suspended value-check on freed bytes is luck, not safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
  • Out-of-scope compiler bug found by review (not filed yet): closure free-var analysis does not descend into a nested push Context {…} block inside a closure body — a var used only there reports unresolved. Phase 0 sidesteps it (capture is at the Fiber level, not via closure), so it does NOT block the unification; worth an issues/ entry in a separate session.

Phases (each: implement → lock with an example → zig build test green → both platforms)

  1. impl Io for Scheduler (the vehicle). Implement the six methods over the fiber primitives. Add a Fiber.canceled/task back-ref so suspend_raw can raise on resume. Keep CBlockingIo intact. Lock: install the fiber Io into context.io, run a root fiber that suspend_raws and is ready()'d — asserts real park/resume through the protocol (not inline). Bridge (the one fiddly bit): async's generic Closure(..$args) -> $R worker → spawn_raw's raw entry/arg. Box the worker thunk on the heap; entry is a C-ABI (env: *void) -> void invoke-thunk (mirrors fib_dispatch), arg is the env.

  2. async/await over the fiber Io (real interleaving). Under a suspending Io, async calls spawn_raw and returns a PENDING Future($R) (no longer born .ready); the spawned body fills f.value/f.state and ready(f.park)s the awaiter. await(f) checks .ready else suspend_raw(f.park) then returns/raises — the suspending sibling of today's immediate await. CBlockingIo keeps the run-inline path (degenerate, still correct). Lock: two context.io.async tasks interleave under the fiber Io (the io.sx layer, replacing the bespoke sched.go).

  3. True cancellation via suspend_raw -> !. cancel(f) flips f.canceled AND ready(f.park)s / wakes the worker fiber so its NEXT suspend_raw raises IoErr.Canceled. The worker's suspends (await, a future io.sleep) propagate via try/!; the worker body unwinds, the future ends .canceled, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend (assert via a shared log: the post-suspend line never prints).

  4. race over Futures — context.io.race((a: fa, b: fb)). Re-home the proven race logic (winner scan, deregister-all-on-wake, structured cancel+join of losers) from sched.race(*Task tuple) onto *Future handles + the Io protocol. The type-level machinery ports UNCHANGED — RaceResult($T), make_variant, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps *Task*Future and suspend_selfsuspend_raw/ready. Cancellation of losers now uses Phase 3 (their next suspend raises), so race returns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 at context.io.race; assert winner value + losers' work stopped (not merely flagged).

  5. Converge — retire the bespoke fiber async API. Fold sched.go/wait/cancel/race into the io.sx layer; Scheduler stays as the fiber Io's engine + driver. Migrate 18111821 to the context.io API. One async stack, all behind the protocol. Update the roadmap/checkpoints.

Open decisions (need a call before/within the phase noted)

  • D1 (Phase 1) — impl Io for Scheduler vs a FiberIo wrapper. Direct impl makes context.io BE the scheduler (xx scheduler as the Io value, stateful receiver — mirrors the allocator xx local rule). A wrapper adds a level but decouples the public Io vtable from the scheduler internals. Lean: direct impl (simplest, matches the allocator convention).
  • D2 (Phase 1) — virtual vs real clock under the fiber Io. Tests need the deterministic virtual clock (clock_ms); a real deployment wants time.mono_ms. Thread it as a Scheduler mode, or two Io impls (FiberIo virtual-clock for tests, real-clock for prod). Lean: a clock: enum { virtual; real } field so one impl serves both; tests pin .virtual.
  • D3 (Phase 2) — Future(void) (issue 0150 SIGTRAP). A void-result task can't build Future(void) today. Defer (race/async target non-void), or fix the void struct-field path. Lean: defer, gate with a diagnostic.
  • D4 (Phase 3) — where the cancel flag lives. The Future already has canceled: Atomic(bool); the fiber needs to reach it from suspend_raw. Give Fiber a *Atomic(bool) back-ref to its future's flag (set at spawn_raw), so suspend_raw consults it with no per-suspend lookup. Lean: back-ref pointer.

Validation (every phase)

  • zig build && zig build test green (full corpus).
  • New/changed 18xx examples byte-identical on aarch64-macOS host AND aarch64-linux container (deterministic virtual clock).
  • Adversarial review of each phase (worker + read-only reviewer), per the session workflow.

What this supersedes

  • sched.sx's bespoke go/wait/cancel/race (Phase 5 retires them; the proven logic moves onto the protocol). The just-landed race (commit 9099735e) is the reference logic for Phase 4, not the final home.
  • PLAN-RACE.md's "race on sched.Scheduler" framing — this plan moves it onto context.io per the roadmap's §A5 / §4.6 design-of-record.