Files
sx/current/PLAN-IO-UNIFY.md
agra 8bacb2b01c feat: true cancellation for the fiber Io layer (PLAN-IO-UNIFY Phase 3)
A cancelled async worker now abandons its body at its next suspend instead
of running to completion.

- Cancel-flag back-ref (D4): SpawnOpts.cancel_flag (core.sx) + Fiber.cancel_flag
  (sched.sx), set from opts.cancel_flag in Scheduler.spawn_raw; async passes
  xx @f.canceled (the Future.canceled Atomic(bool) erased to *void).
- Delivery: Scheduler.suspend_raw consults fiber_canceled(self.current) PRE-park
  (raise without parking — no deadlock if cancel landed before the worker ran)
  and POST-resume (cancel landed while parked), raising error.Canceled.
  cancel(f) flips the sticky flag, marks .canceled, and wakes the worker.
- async worker is failable Closure() -> ($R, !); the completion closure
  f.value = worker() catch {…} marks .canceled/.failed and wakes the awaiter,
  so post-suspend side effects never run. New failable io.sleep(ms) is the
  cancellation point.
- Compiler: a -> ! fn whose only error source is try-ing a protocol method
  (io.suspend_raw) was wrongly flagged 'declared ! but never errors';
  collectErrorSites now marks a try of a non-identifier callee as a dynamic
  (opaque) error source, suppressing the warning.
- Two UAFs found by adversarial review and fixed: (1) cancel-before-park
  orphaned io.sleep's armed timer — suspend_raw's pre-park raise now evicts the
  current fiber's timer/waiter first; (2) cancel(f) could wake a reaped worker —
  now only wakes when was_pending.

Migrated 1805/1806/1824 to failable workers. Lock: example 1825 (seq: 1 -99,
post-suspend line never runs); byte-identical on aarch64-macOS + aarch64-linux.
.ir churn is the SpawnOpts layout change (type-table string renumbering).
2026-06-28 09:19:01 +03:00

15 KiB
Raw Blame History

PLAN-IO-UNIFY — fold the fiber scheduler behind context.io, re-home race

Why

Today there are two parallel async stacks:

stack behind context.io? real suspension? cancellation channel
io.sx async/await/cancel/Future yes (impl Io for CBlockingIo) no — runs the worker inline to completion suspend_raw -> ! / IoErr.Canceled (designed, unused)
sched.sx go/wait/cancel/race (just landed) no yes (swap_context fibers) none — suspend_self -> void

context.io is structurally Zig's std.Io (an Io protocol carried implicitly in Context — better ergonomics than Zig's explicit io: param), and the roadmap (§A5, §4.6) already says the fiber scheduler should be one of its Io vtables and that race is context.io.race(..) over Futures. The just-landed race on sched.Scheduler over *Task is the proven LOGIC at the wrong LAYER.

Goal: make the fiber Scheduler an impl Io, lift async/await/cancel/race onto the Io protocol so they run colorblind under either impl, and let cancellation fall out of the existing suspend_raw -> ! contract (the "true cancellation, model A" the user picked — already the interface's design). One async stack, behind context.io.

The fiber → Io mapping (the crux)

Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer } (core.sx). Map each onto the existing fiber primitives in sched.sx (spawn/suspend_self/wake/sleep/block_on_fd/run):

Io method fiber realization
spawn_raw(entry, arg, opts) -> *void spawn a fiber whose body invokes entry(arg) (raw C-ABI thunk, not a closure — see Bridge below). Returns the *Fiber as the opaque handle.
suspend_raw(park) -> ! suspend_self(), then on resume CHECK the current task's cancel flag and raise IoErr.Canceled if set. park.handle = the *Fiber to re-ready. This is the cancellation delivery point.
ready(park) wake(park.handle as *Fiber) (already guarded on .suspended).
arm_timer(deadline_ms, park) -> *void arm a Timer{deadline, fiber=park.handle} (today's sleep minus the self-suspend); return the timer handle so a cancel can evict it.
poll(deadline_ms) -> i64 ONE iteration of the run loop: drain ready, then fire the earliest timer / block on fds up to deadline_ms. Returns the next pending deadline (or sentinel when idle).
now_ms() -> i64 the virtual clock_ms (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible.

Scheduler.run() stays as the explicit DRIVER (the top-level loop that calls poll to quiescence), installed via push Context { io = xx scheduler } { … s.run(); } — exactly the existing sched examples, just with the scheduler now reachable as context.io.

Status (2026-06-28)

  • Phase 3 — TRUE cancellation via suspend_raw -> !. DONE. A cancelled async worker now abandons its body at its next suspend instead of running to completion. Pieces:

    • Cancel-flag back-ref (D4 — back-ref pointer, chosen): SpawnOpts.cancel_flag: *void (core.sx) + Fiber.cancel_flag: *void (sched.sx), set from opts.cancel_flag in Scheduler.spawn_raw. async passes xx @f.canceled (the Future.canceled Atomic(bool) erased to *void).
    • Delivery: Scheduler.suspend_raw checks fiber_canceled(self.current) (a *Atomic(bool) load) PRE-park (raise without parking — no deadlock if cancel landed before the worker ran) and POST-resume (cancel landed while parked), raising error.Canceled (a bare -> !; set inferred). cancel(f) flips the sticky flag, marks .canceled, and ready(.{handle=f.task})s the worker.
    • Worker is failable Closure() -> ($R, !): the async completion closure f.value = worker() catch { … } (the captured-failable-closure-call the Phase-3-prereq fix enabled) marks .canceled/.failed and wakes the awaiter; the worker's post-suspend side effects never run. New failable io.sleep(ms) (arm_timer + try suspend_raw) is the cancellation point.
    • Compiler gap fixed: a -> ! fn whose only error source is try-ing a protocol method (io.suspend_raw) was wrongly flagged "declared ! but never errors". collectErrorSites (error_analysis.zig) now sets a dyn flag for a try of a non-identifier callee (opaque error channel), suppressing the warning.
    • Two UAFs found by adversarial review and FIXED: (1) cancel-before-park orphaned io.sleep's armed timer → suspend_raw's pre-park raise now evicts the current fiber's timer/waiter first. (2) cancel(f) woke a possibly-reaped worker → now only wakes when was_pending (.pending before the store).
    • Migrated 1805/1806/1824 to failable workers. Lock: examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx (seq: 1 -99 — post-suspend line never runs). Validated byte-identical on aarch64-macOS host AND aarch64-linux container (1824 + 1825). Suite 853/0. Expected .ir churn (SpawnOpts layout) regenerated; no non-.ir snapshot changed.
  • Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE. The async completion closure (b.run = () => { f.value = worker() catch {…} }) captures a failable worker and consumes its error channel; the free-variable capture analysis (collectCaptures in src/ir/lower/closure.zig) did not descend into the error-handling / context / asm / multi-assign nodes, so worker was never captured — inside the lambda it resolved against an empty scope and typed as .unresolved (catch/try then rejected it). Fixed: added try_expr, catch_expr, onfail_stmt, raise_stmt, multi_assign, push_stmt, comptime_expr, insert_expr, spread_expr, asm_expr arms to collectCaptures. Adversarially reviewed (captures resolve, locals correctly excluded, no false-positive captures, 851/0). Lock: example examples/closures/0314-closures-capture-failable-call.sx (catch + try over a captured failable closure; pure language feature, host-only). The push_stmt arm also fixes the previously-noted "free-var analysis doesn't descend into a nested push Context {…}" gap. Phase 3 is now unblocked.

    • Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked Phase 3): (1) calling a closure stored in a struct data field typed as unresolved (value → garbage; failable → can't catch) — RESOLVED (issues/0201): CallResolver.plan gained a closure/fn-pointer field arm and the lowering closure-field arm now also handles bare .function fields; regression examples/closures/0315-closures-struct-field-call.sx. (2) asm write-through place through a deref (asm { … "+r" -> @(p.*) }) fails LLVM verification — repros with NO closure (independent of capture analysis); possibly an unsupported deref-place form rather than a confirmed bug, not filed.

Status (2026-06-27)

  • Phase 0 — fibers inherit the spawn-time context. DONE (2f2d7f1d). Discovered during Phase 1: a fiber body ran under __sx_default_context (the abi(.c) fib_dispatch dropped the implicit context), so a scheduler installed as context.io was invisible inside a worker. Fixed: Scheduler.spawn snapshots contextFiber.dctx; fib_dispatch re-pushes it. Behavior-preserving (suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822.
  • Phase 1 — impl Io for Scheduler. DONE (5c30bfe0, hardened da7dd1f1). Six methods over the fiber primitives; spawn_raw bridges the erased (*void)->void worker thunk via an fn-ptr round-trip. Lock: example 1823 (spawn→arm→suspend→ready→resume entirely through context.io, deterministic). Adversarial review fixed: arm_timer/spawn_raw null guards, poll fd-pending abort + deadline_ms doc, stale fib_dispatch comment.
  • Resolved design decisions: D1 = direct impl Io for Scheduler (chosen). D2 = now_ms returns the virtual clock_ms (deterministic) — a real-clock variant is later. D4 = deferred to Phase 3.
  • Phase 2 — async/await colorblind over the fiber Io. DONE (967aed67, hardened ada8d162). async heap-allocs a *Future, boxes a completion closure in a monomorphic ThunkBox, and submits via io.spawn_raw (inline under CBlockingIo, a fiber under the scheduler); await parks via suspend_raw until ready. Protocol changed to suspend_raw(park: *ParkToken) (write-back of the awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adopted push .{ … }. Lock: example 1824 (deferral visible: 1 2 10 20 123). Review fixed: one-awaiter await guard; documented the Future allocator-lifetime contract + that cancel doesn't stop an already-spawned worker (Phase 3).
    • Resolved D2 (ParkToken): suspend_raw(*ParkToken) write-back (chosen over a registry). ready() liveness (CONCERN 6): safe for single async/await (awaiter is suspended, not reaped, when readied); race fan-in must still deregister (Phase 4).
    • Carried to convergence: async should capture the scheduler's long-lived allocator (like sched.go's own_allocator) instead of the call-site context.allocator — needs a protocol affordance; documented as a contract for now.
  • Open for later phases:
    • ParkToken↔fiber binding. ready(park) needs park.handle = the awaiter *Fiber. The scheduler knows self.current at suspend; the cleanest is suspend_raw(park: *ParkToken) writing park.handle = self.current before parking (a small protocol change: the materializer installs thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry.
    • ready() liveness (review CONCERN 6). Casting a stale/reaped *Fiber handle and wake-ing it is a latent UAF once real await runs — wake's .suspended value-check on freed bytes is luck, not safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
  • Out-of-scope compiler bug found by review (not filed yet): closure free-var analysis does not descend into a nested push Context {…} block inside a closure body — a var used only there reports unresolved. Phase 0 sidesteps it (capture is at the Fiber level, not via closure), so it does NOT block the unification; worth an issues/ entry in a separate session.

Phases (each: implement → lock with an example → zig build test green → both platforms)

  1. impl Io for Scheduler (the vehicle). Implement the six methods over the fiber primitives. Add a Fiber.canceled/task back-ref so suspend_raw can raise on resume. Keep CBlockingIo intact. Lock: install the fiber Io into context.io, run a root fiber that suspend_raws and is ready()'d — asserts real park/resume through the protocol (not inline). Bridge (the one fiddly bit): async's generic Closure(..$args) -> $R worker → spawn_raw's raw entry/arg. Box the worker thunk on the heap; entry is a C-ABI (env: *void) -> void invoke-thunk (mirrors fib_dispatch), arg is the env.

  2. async/await over the fiber Io (real interleaving). Under a suspending Io, async calls spawn_raw and returns a PENDING Future($R) (no longer born .ready); the spawned body fills f.value/f.state and ready(f.park)s the awaiter. await(f) checks .ready else suspend_raw(f.park) then returns/raises — the suspending sibling of today's immediate await. CBlockingIo keeps the run-inline path (degenerate, still correct). Lock: two context.io.async tasks interleave under the fiber Io (the io.sx layer, replacing the bespoke sched.go).

  3. True cancellation via suspend_raw -> !. cancel(f) flips f.canceled AND ready(f.park)s / wakes the worker fiber so its NEXT suspend_raw raises IoErr.Canceled. The worker's suspends (await, a future io.sleep) propagate via try/!; the worker body unwinds, the future ends .canceled, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend (assert via a shared log: the post-suspend line never prints).

  4. race over Futures — context.io.race((a: fa, b: fb)). Re-home the proven race logic (winner scan, deregister-all-on-wake, structured cancel+join of losers) from sched.race(*Task tuple) onto *Future handles + the Io protocol. The type-level machinery ports UNCHANGED — RaceResult($T), make_variant, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps *Task*Future and suspend_selfsuspend_raw/ready. Cancellation of losers now uses Phase 3 (their next suspend raises), so race returns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 at context.io.race; assert winner value + losers' work stopped (not merely flagged).

  5. Converge — retire the bespoke fiber async API. Fold sched.go/wait/cancel/race into the io.sx layer; Scheduler stays as the fiber Io's engine + driver. Migrate 18111821 to the context.io API. One async stack, all behind the protocol. Update the roadmap/checkpoints.

Open decisions (need a call before/within the phase noted)

  • D1 (Phase 1) — impl Io for Scheduler vs a FiberIo wrapper. Direct impl makes context.io BE the scheduler (xx scheduler as the Io value, stateful receiver — mirrors the allocator xx local rule). A wrapper adds a level but decouples the public Io vtable from the scheduler internals. Lean: direct impl (simplest, matches the allocator convention).
  • D2 (Phase 1) — virtual vs real clock under the fiber Io. Tests need the deterministic virtual clock (clock_ms); a real deployment wants time.mono_ms. Thread it as a Scheduler mode, or two Io impls (FiberIo virtual-clock for tests, real-clock for prod). Lean: a clock: enum { virtual; real } field so one impl serves both; tests pin .virtual.
  • D3 (Phase 2) — Future(void) (issue 0150 SIGTRAP). A void-result task can't build Future(void) today. Defer (race/async target non-void), or fix the void struct-field path. Lean: defer, gate with a diagnostic.
  • D4 (Phase 3) — where the cancel flag lives. The Future already has canceled: Atomic(bool); the fiber needs to reach it from suspend_raw. Give Fiber a *Atomic(bool) back-ref to its future's flag (set at spawn_raw), so suspend_raw consults it with no per-suspend lookup. Lean: back-ref pointer.

Validation (every phase)

  • zig build && zig build test green (full corpus).
  • New/changed 18xx examples byte-identical on aarch64-macOS host AND aarch64-linux container (deterministic virtual clock).
  • Adversarial review of each phase (worker + read-only reviewer), per the session workflow.

What this supersedes

  • sched.sx's bespoke go/wait/cancel/race (Phase 5 retires them; the proven logic moves onto the protocol). The just-landed race (commit 9099735e) is the reference logic for Phase 4, not the final home.
  • PLAN-RACE.md's "race on sched.Scheduler" framing — this plan moves it onto context.io per the roadmap's §A5 / §4.6 design-of-record.