A cancelled async worker now abandons its body at its next suspend instead
of running to completion.
- Cancel-flag back-ref (D4): SpawnOpts.cancel_flag (core.sx) + Fiber.cancel_flag
(sched.sx), set from opts.cancel_flag in Scheduler.spawn_raw; async passes
xx @f.canceled (the Future.canceled Atomic(bool) erased to *void).
- Delivery: Scheduler.suspend_raw consults fiber_canceled(self.current) PRE-park
(raise without parking — no deadlock if cancel landed before the worker ran)
and POST-resume (cancel landed while parked), raising error.Canceled.
cancel(f) flips the sticky flag, marks .canceled, and wakes the worker.
- async worker is failable Closure() -> ($R, !); the completion closure
f.value = worker() catch {…} marks .canceled/.failed and wakes the awaiter,
so post-suspend side effects never run. New failable io.sleep(ms) is the
cancellation point.
- Compiler: a -> ! fn whose only error source is try-ing a protocol method
(io.suspend_raw) was wrongly flagged 'declared ! but never errors';
collectErrorSites now marks a try of a non-identifier callee as a dynamic
(opaque) error source, suppressing the warning.
- Two UAFs found by adversarial review and fixed: (1) cancel-before-park
orphaned io.sleep's armed timer — suspend_raw's pre-park raise now evicts the
current fiber's timer/waiter first; (2) cancel(f) could wake a reaped worker —
now only wakes when was_pending.
Migrated 1805/1806/1824 to failable workers. Lock: example 1825 (seq: 1 -99,
post-suspend line never runs); byte-identical on aarch64-macOS + aarch64-linux.
.ir churn is the SpawnOpts layout change (type-table string renumbering).
15 KiB
PLAN-IO-UNIFY — fold the fiber scheduler behind context.io, re-home race
Why
Today there are two parallel async stacks:
| stack | behind context.io? |
real suspension? | cancellation channel |
|---|---|---|---|
io.sx async/await/cancel/Future |
yes (impl Io for CBlockingIo) |
no — runs the worker inline to completion | suspend_raw -> ! / IoErr.Canceled (designed, unused) |
sched.sx go/wait/cancel/race (just landed) |
no | yes (swap_context fibers) |
none — suspend_self -> void |
context.io is structurally Zig's std.Io (an Io protocol carried implicitly in Context — better
ergonomics than Zig's explicit io: param), and the roadmap (§A5, §4.6) already says the fiber
scheduler should be one of its Io vtables and that race is context.io.race(..) over Futures.
The just-landed race on sched.Scheduler over *Task is the proven LOGIC at the wrong LAYER.
Goal: make the fiber Scheduler an impl Io, lift async/await/cancel/race onto the Io
protocol so they run colorblind under either impl, and let cancellation fall out of the existing
suspend_raw -> ! contract (the "true cancellation, model A" the user picked — already the interface's
design). One async stack, behind context.io.
The fiber → Io mapping (the crux)
Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer } (core.sx). Map each onto
the existing fiber primitives in sched.sx (spawn/suspend_self/wake/sleep/block_on_fd/run):
Io method |
fiber realization |
|---|---|
spawn_raw(entry, arg, opts) -> *void |
spawn a fiber whose body invokes entry(arg) (raw C-ABI thunk, not a closure — see Bridge below). Returns the *Fiber as the opaque handle. |
suspend_raw(park) -> ! |
suspend_self(), then on resume CHECK the current task's cancel flag and raise IoErr.Canceled if set. park.handle = the *Fiber to re-ready. This is the cancellation delivery point. |
ready(park) |
wake(park.handle as *Fiber) (already guarded on .suspended). |
arm_timer(deadline_ms, park) -> *void |
arm a Timer{deadline, fiber=park.handle} (today's sleep minus the self-suspend); return the timer handle so a cancel can evict it. |
poll(deadline_ms) -> i64 |
ONE iteration of the run loop: drain ready, then fire the earliest timer / block on fds up to deadline_ms. Returns the next pending deadline (or sentinel when idle). |
now_ms() -> i64 |
the virtual clock_ms (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible. |
Scheduler.run() stays as the explicit DRIVER (the top-level loop that calls poll to quiescence),
installed via push Context { io = xx scheduler } { … s.run(); } — exactly the existing sched examples,
just with the scheduler now reachable as context.io.
Status (2026-06-28)
-
Phase 3 — TRUE cancellation via
suspend_raw -> !. DONE. A cancelled async worker now abandons its body at its next suspend instead of running to completion. Pieces:- Cancel-flag back-ref (D4 — back-ref pointer, chosen):
SpawnOpts.cancel_flag: *void(core.sx) +Fiber.cancel_flag: *void(sched.sx), set fromopts.cancel_flaginScheduler.spawn_raw.asyncpassesxx @f.canceled(theFuture.canceledAtomic(bool)erased to*void). - Delivery:
Scheduler.suspend_rawchecksfiber_canceled(self.current)(a*Atomic(bool)load) PRE-park (raise without parking — no deadlock if cancel landed before the worker ran) and POST-resume (cancel landed while parked), raisingerror.Canceled(a bare-> !; set inferred).cancel(f)flips the sticky flag, marks.canceled, andready(.{handle=f.task})s the worker. - Worker is failable
Closure() -> ($R, !): theasynccompletion closuref.value = worker() catch { … }(the captured-failable-closure-call the Phase-3-prereq fix enabled) marks.canceled/.failedand wakes the awaiter; the worker's post-suspend side effects never run. New failableio.sleep(ms)(arm_timer +try suspend_raw) is the cancellation point. - Compiler gap fixed: a
-> !fn whose only error source istry-ing a protocol method (io.suspend_raw) was wrongly flagged "declared!but never errors".collectErrorSites(error_analysis.zig) now sets adynflag for atryof a non-identifier callee (opaque error channel), suppressing the warning. - Two UAFs found by adversarial review and FIXED: (1) cancel-before-park
orphaned
io.sleep's armed timer →suspend_raw's pre-park raise now evicts the current fiber's timer/waiter first. (2)cancel(f)woke a possibly-reaped worker → now only wakes whenwas_pending(.pendingbefore the store). - Migrated 1805/1806/1824 to failable workers. Lock:
examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx(seq: 1 -99— post-suspend line never runs). Validated byte-identical on aarch64-macOS host AND aarch64-linux container (1824 + 1825). Suite 853/0. Expected.irchurn (SpawnOpts layout) regenerated; no non-.irsnapshot changed.
- Cancel-flag back-ref (D4 — back-ref pointer, chosen):
-
Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE. The async completion closure (
b.run = () => { f.value = worker() catch {…} }) captures a failableworkerand consumes its error channel; the free-variable capture analysis (collectCapturesinsrc/ir/lower/closure.zig) did not descend into the error-handling / context / asm / multi-assign nodes, soworkerwas never captured — inside the lambda it resolved against an empty scope and typed as.unresolved(catch/trythen rejected it). Fixed: addedtry_expr,catch_expr,onfail_stmt,raise_stmt,multi_assign,push_stmt,comptime_expr,insert_expr,spread_expr,asm_exprarms tocollectCaptures. Adversarially reviewed (captures resolve, locals correctly excluded, no false-positive captures, 851/0). Lock: exampleexamples/closures/0314-closures-capture-failable-call.sx(catch + try over a captured failable closure; pure language feature, host-only). Thepush_stmtarm also fixes the previously-noted "free-var analysis doesn't descend into a nestedpush Context {…}" gap. Phase 3 is now unblocked.- Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked
Phase 3): (1) calling a closure stored in a struct data field typed as
unresolved(value → garbage; failable → can'tcatch) — RESOLVED (issues/0201):CallResolver.plangained a closure/fn-pointer field arm and the lowering closure-field arm now also handles bare.functionfields; regressionexamples/closures/0315-closures-struct-field-call.sx. (2) asm write-through place through a deref (asm { … "+r" -> @(p.*) }) fails LLVM verification — repros with NO closure (independent of capture analysis); possibly an unsupported deref-place form rather than a confirmed bug, not filed.
- Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked
Phase 3): (1) calling a closure stored in a struct data field typed as
Status (2026-06-27)
- Phase 0 — fibers inherit the spawn-time context. DONE (
2f2d7f1d). Discovered during Phase 1: a fiber body ran under__sx_default_context(theabi(.c)fib_dispatchdropped the implicit context), so a scheduler installed ascontext.iowas invisible inside a worker. Fixed:Scheduler.spawnsnapshotscontext→Fiber.dctx;fib_dispatchre-pushes it. Behavior-preserving (suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822. - Phase 1 —
impl Io for Scheduler. DONE (5c30bfe0, hardenedda7dd1f1). Six methods over the fiber primitives;spawn_rawbridges the erased(*void)->voidworker thunk via an fn-ptr round-trip. Lock: example 1823 (spawn→arm→suspend→ready→resume entirely throughcontext.io, deterministic). Adversarial review fixed:arm_timer/spawn_rawnull guards,pollfd-pending abort +deadline_msdoc, stalefib_dispatchcomment. - Resolved design decisions: D1 = direct
impl Io for Scheduler(chosen). D2 =now_msreturns the virtualclock_ms(deterministic) — a real-clock variant is later. D4 = deferred to Phase 3. - Phase 2 —
async/awaitcolorblind over the fiber Io. DONE (967aed67, hardenedada8d162).asyncheap-allocs a*Future, boxes a completion closure in a monomorphicThunkBox, and submits viaio.spawn_raw(inline underCBlockingIo, a fiber under the scheduler);awaitparks viasuspend_rawuntil ready. Protocol changed tosuspend_raw(park: *ParkToken)(write-back of the awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adoptedpush .{ … }. Lock: example 1824 (deferral visible:1 2 10 20 123). Review fixed: one-awaiterawaitguard; documented the Future allocator-lifetime contract + thatcanceldoesn't stop an already-spawned worker (Phase 3).- Resolved D2 (ParkToken):
suspend_raw(*ParkToken)write-back (chosen over a registry). ready() liveness (CONCERN 6): safe for single async/await (awaiter is suspended, not reaped, when readied);racefan-in must still deregister (Phase 4). - Carried to convergence:
asyncshould capture the scheduler's long-lived allocator (likesched.go'sown_allocator) instead of the call-sitecontext.allocator— needs a protocol affordance; documented as a contract for now.
- Resolved D2 (ParkToken):
- Open for later phases:
- ParkToken↔fiber binding.
ready(park)needspark.handle= the awaiter*Fiber. The scheduler knowsself.currentat suspend; the cleanest issuspend_raw(park: *ParkToken)writingpark.handle = self.currentbefore parking (a small protocol change: the materializer installs thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry. ready()liveness (review CONCERN 6). Casting a stale/reaped*Fiberhandle andwake-ing it is a latent UAF once realawaitruns —wake's.suspendedvalue-check on freed bytes is luck, not safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
- ParkToken↔fiber binding.
- Out-of-scope compiler bug found by review (not filed yet): closure free-var analysis does not
descend into a nested
push Context {…}block inside a closure body — a var used only there reportsunresolved. Phase 0 sidesteps it (capture is at theFiberlevel, not via closure), so it does NOT block the unification; worth anissues/entry in a separate session.
Phases (each: implement → lock with an example → zig build test green → both platforms)
-
impl Io for Scheduler(the vehicle). Implement the six methods over the fiber primitives. Add aFiber.canceled/task back-ref sosuspend_rawcan raise on resume. KeepCBlockingIointact. Lock: install the fiber Io intocontext.io, run a root fiber thatsuspend_raws and isready()'d — asserts real park/resume through the protocol (not inline). Bridge (the one fiddly bit):async's genericClosure(..$args) -> $Rworker →spawn_raw's rawentry/arg. Box the worker thunk on the heap;entryis a C-ABI(env: *void) -> voidinvoke-thunk (mirrorsfib_dispatch),argis the env. -
async/awaitover the fiber Io (real interleaving). Under a suspending Io,asynccallsspawn_rawand returns a PENDINGFuture($R)(no longer born.ready); the spawned body fillsf.value/f.stateandready(f.park)s the awaiter.await(f)checks.readyelsesuspend_raw(f.park)then returns/raises — the suspending sibling of today's immediateawait.CBlockingIokeeps the run-inline path (degenerate, still correct). Lock: twocontext.io.asynctasks interleave under the fiber Io (the io.sx layer, replacing the bespokesched.go). -
True cancellation via
suspend_raw -> !.cancel(f)flipsf.canceledANDready(f.park)s / wakes the worker fiber so its NEXTsuspend_rawraisesIoErr.Canceled. The worker's suspends (await, a futureio.sleep) propagate viatry/!; the worker body unwinds, the future ends.canceled, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend (assert via a shared log: the post-suspend line never prints). -
raceover Futures —context.io.race((a: fa, b: fb)). Re-home the proven race logic (winner scan, deregister-all-on-wake, structured cancel+join of losers) fromsched.race(*Task tuple)onto*Futurehandles + theIoprotocol. The type-level machinery ports UNCHANGED —RaceResult($T),make_variant, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps*Task→*Futureandsuspend_self→suspend_raw/ready. Cancellation of losers now uses Phase 3 (their next suspend raises), soracereturns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 atcontext.io.race; assert winner value + losers' work stopped (not merely flagged). -
Converge — retire the bespoke fiber async API. Fold
sched.go/wait/cancel/raceinto the io.sx layer;Schedulerstays as the fiber Io's engine + driver. Migrate 1811–1821 to thecontext.ioAPI. One async stack, all behind the protocol. Update the roadmap/checkpoints.
Open decisions (need a call before/within the phase noted)
- D1 (Phase 1) —
impl Io for Schedulervs aFiberIowrapper. Direct impl makescontext.ioBE the scheduler (xx scheduleras the Io value, stateful receiver — mirrors the allocatorxx localrule). A wrapper adds a level but decouples the public Io vtable from the scheduler internals. Lean: direct impl (simplest, matches the allocator convention). - D2 (Phase 1) — virtual vs real clock under the fiber Io. Tests need the deterministic virtual clock
(
clock_ms); a real deployment wantstime.mono_ms. Thread it as a Scheduler mode, or two Io impls (FiberIovirtual-clock for tests, real-clock for prod). Lean: aclock: enum { virtual; real }field so one impl serves both; tests pin.virtual. - D3 (Phase 2) —
Future(void)(issue 0150 SIGTRAP). Avoid-result task can't buildFuture(void)today. Defer (race/async target non-void), or fix thevoidstruct-field path. Lean: defer, gate with a diagnostic. - D4 (Phase 3) — where the cancel flag lives. The
Futurealready hascanceled: Atomic(bool); the fiber needs to reach it fromsuspend_raw. GiveFibera*Atomic(bool)back-ref to its future's flag (set atspawn_raw), sosuspend_rawconsults it with no per-suspend lookup. Lean: back-ref pointer.
Validation (every phase)
zig build && zig build testgreen (full corpus).- New/changed
18xxexamples byte-identical on aarch64-macOS host AND aarch64-linux container (deterministic virtual clock). - Adversarial review of each phase (worker + read-only reviewer), per the session workflow.
What this supersedes
sched.sx's bespokego/wait/cancel/race(Phase 5 retires them; the proven logic moves onto the protocol). The just-landedrace(commit9099735e) is the reference logic for Phase 4, not the final home.- PLAN-RACE.md's "race on
sched.Scheduler" framing — this plan moves it ontocontext.ioper the roadmap's §A5 / §4.6 design-of-record.