Closes the documented per-spawn closure-env leak and most of the async leak,
using only the existing closure.env / closure.fn_ptr field accessors — no compiler
change. Also names the fat-pointer ABI in core.sx (ClosureRaw / SliceRaw) so the
underlying {fn_ptr, env} / {ptr, len} layout is discoverable in one place.
- Fiber body env: Scheduler.reap_fiber frees f.body.env via f.dctx.allocator (the
spawn-time allocator snapshotted in dctx) at all three reap sites (run/poll/
deinit). 1820's 'live after deinit' 3 -> 0.
- Async box + closure envs: sx_run_boxed_closure frees the ThunkBox, the
completion-closure env, and the worker's env (new ThunkBox.worker_env) the
instant the worker completes.
- Async Future: two-flag ownership — Future.worker_done (set at the end of the
completion closure) + consumed (set at the end of await); fut_release frees the
heap Future (via the captured Future.alloc) when BOTH are set, so the LAST of
{worker, await} reclaims it. await now CONSUMES the future (single-use; touching
it afterward is a use-after-free — documented). Residual for an AWAITED future
is 0 (lock: examples/concurrency/1827); a never-awaited future (fire-and-forget /
race loser) keeps only its Future struct — the structured-concurrency remainder.
Self-reviewed across orderings (await-after/before-complete, cancel-then-await,
cancel-while-parked, double-free via await+deinit, race residual, blocking impl,
cross-allocator reap) — all deterministic, no UAF/double-free. Suite 855/0;
byte-identical on aarch64-macOS + aarch64-linux; .ir churn is the core.sx +
Future/ThunkBox field additions.
20 KiB
PLAN-IO-UNIFY — fold the fiber scheduler behind context.io, re-home race
Why
Today there are two parallel async stacks:
| stack | behind context.io? |
real suspension? | cancellation channel |
|---|---|---|---|
io.sx async/await/cancel/Future |
yes (impl Io for CBlockingIo) |
no — runs the worker inline to completion | suspend_raw -> ! / IoErr.Canceled (designed, unused) |
sched.sx go/wait/cancel/race (just landed) |
no | yes (swap_context fibers) |
none — suspend_self -> void |
context.io is structurally Zig's std.Io (an Io protocol carried implicitly in Context — better
ergonomics than Zig's explicit io: param), and the roadmap (§A5, §4.6) already says the fiber
scheduler should be one of its Io vtables and that race is context.io.race(..) over Futures.
The just-landed race on sched.Scheduler over *Task is the proven LOGIC at the wrong LAYER.
Goal: make the fiber Scheduler an impl Io, lift async/await/cancel/race onto the Io
protocol so they run colorblind under either impl, and let cancellation fall out of the existing
suspend_raw -> ! contract (the "true cancellation, model A" the user picked — already the interface's
design). One async stack, behind context.io.
The fiber → Io mapping (the crux)
Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer } (core.sx). Map each onto
the existing fiber primitives in sched.sx (spawn/suspend_self/wake/sleep/block_on_fd/run):
Io method |
fiber realization |
|---|---|
spawn_raw(entry, arg, opts) -> *void |
spawn a fiber whose body invokes entry(arg) (raw C-ABI thunk, not a closure — see Bridge below). Returns the *Fiber as the opaque handle. |
suspend_raw(park) -> ! |
suspend_self(), then on resume CHECK the current task's cancel flag and raise IoErr.Canceled if set. park.handle = the *Fiber to re-ready. This is the cancellation delivery point. |
ready(park) |
wake(park.handle as *Fiber) (already guarded on .suspended). |
arm_timer(deadline_ms, park) -> *void |
arm a Timer{deadline, fiber=park.handle} (today's sleep minus the self-suspend); return the timer handle so a cancel can evict it. |
poll(deadline_ms) -> i64 |
ONE iteration of the run loop: drain ready, then fire the earliest timer / block on fds up to deadline_ms. Returns the next pending deadline (or sentinel when idle). |
now_ms() -> i64 |
the virtual clock_ms (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible. |
Scheduler.run() stays as the explicit DRIVER (the top-level loop that calls poll to quiescence),
installed via push Context { io = xx scheduler } { … s.run(); } — exactly the existing sched examples,
just with the scheduler now reachable as context.io.
Status (2026-06-28)
-
Follow-up — heap leak reclamation (fiber-env + async). DONE. Closed the documented per-spawn closure-env leak and most of the async leak, using only the existing
closure.env/.fn_ptrfield accessors (now also named byClosureRaw/SliceRawABI-view structs in core.sx) — NO compiler change.- Fiber body env:
Scheduler.reap_fiberfreesf.body.envviaf.dctx.allocator(the spawn-time allocator snapshotted indctx) at all 3 reap sites. 1820'slive after deinit3 → 0. - Async box + closure envs:
sx_run_boxed_closurefrees theThunkBox, the completion-closure env, and the worker's env (newThunkBox.worker_env) the instant the worker completes. - Async Future: two-flag ownership —
Future.worker_done(set at the end of the completion closure) +consumed(set at the end ofawait);fut_releasefrees the heapFuture(via the storedFuture.alloc) when BOTH are set, so the LAST of {worker, await} reclaims it.awaitnow CONSUMES the future (single-use; documented). Residual for an AWAITED future: 0 (lock:examples/concurrency/1827-...). A NEVER-awaited future (fire-and-forget /raceloser) keeps only itsFuturestruct (consumed never set) — the structured-concurrency remainder, deferred. - Self-reviewed across orderings (await-after/before-complete, cancel-then-await,
cancel-while-parked, double-free via await+deinit, race residual, blocking
impl, cross-allocator reap) — all deterministic, no UAF/double-free. Suite
855/0; byte-identical on aarch64-macOS + aarch64-linux;
.irchurn (core.sx + Future/ThunkBox field additions) regenerated, only 1820 stdout changed otherwise.
- Fiber body env:
-
Phase 5 — CONVERGE: retire the bespoke fiber async API. DONE. Io unification COMPLETE. The bespoke
Tasklayer (Task/TaskState/TaskErr/go/wait/cancel(Task)+Scheduler.task_allocsand its deinit handling, ~130 lines) is removed from sched.sx. There is now ONE async stack:context.io.async/await/cancel/race/sleepover theIoprotocol, with theScheduleras the fiber Io's engine + driver (spawn/yield_now/suspend_self/wake/run/block_on_fdstay as the raw primitives). Migrated the fourgo/waitusers tocontext.io: 1813 (interleave + cancel), 1817 (m1 end-to-end sum=123), 1819 (double-AWAIT loud-abort via the Future one-awaiter guard), 1820 (deinit — thego/task_allocstasks dropped; it now exercises timers/io_waiters/kq cleanup,freed=2/live=3).racestays in sched.sx (needs meta.sx). Updated readme.md (the user-facing async section now documentscontext.io.async/await/race/sleep) and the stalesched.go/sched.Taskcomments in io.sx. Suite 854/0; no.irchurn (the Task removal touched no snapshotted IR); migrated examples byte-identical on aarch64-macOS + aarch64-linux. PLAN-IO-UNIFY Phases 0–5 all complete — the two parallel async stacks are now one, behindcontext.io. -
Phase 4 —
raceover Futures viacontext.io.race. DONE. Re-homed the proven first-wins race fromsched.race(*Task)onto*Futurehandles + theIoprotocol; the old Task-basedraceis REPLACED (ufcs overload-by-receiver is rejected — "duplicate top-level decl" — and only 1821 used it).- Protocol affordance: added
Io.current_park() -> ParkToken(the running fiber as a token, captured WITHOUT parking) so race can register the SAME coordinator across N futures'parkslots, then park once viasuspend_raw; any completionreadys it. Scheduler returns{self.current}(bails outside a fiber); CBlockingIo returns{null}(race never parks there — futures born.ready). The await comment already anticipated this fan-in. - race (
ufcs (io: Io, futures: $T) -> RaceResult(T), in sched.sx — it needs meta.sx'smake_enum/make_variant, and pulling that into the io.sx prelude part-file would cycle): winner scan → register+park → deregister →make_variantthe winner → Phase-3canceleach loser (NO join).RaceResultreused unchanged (*Future(R)projects field 0value→ R). - Winner-time return: with true cancellation the parked losers stop at their
next suspend (their timers evicted by cancel's wake), so race returns at the
winner's virtual time, not the slowest loser's. 1821 re-pointed to
context.io.async+context.io.race:winner a=111, losers.canceled, completion log ONLYtask 1 @ 10ms, final clock10ms(was 30 under the old cooperative join). Byte-identical on aarch64-macOS + aarch64-linux. Suite 853/0;.irchurn (current_park vtable method) regenerated, only 1821 stdout changed otherwise.
- Protocol affordance: added
-
Phase 3 — TRUE cancellation via
suspend_raw -> !. DONE. A cancelled async worker now abandons its body at its next suspend instead of running to completion. Pieces:- Cancel-flag back-ref (D4 — back-ref pointer, chosen):
SpawnOpts.cancel_flag: *void(core.sx) +Fiber.cancel_flag: *void(sched.sx), set fromopts.cancel_flaginScheduler.spawn_raw.asyncpassesxx @f.canceled(theFuture.canceledAtomic(bool)erased to*void). - Delivery:
Scheduler.suspend_rawchecksfiber_canceled(self.current)(a*Atomic(bool)load) PRE-park (raise without parking — no deadlock if cancel landed before the worker ran) and POST-resume (cancel landed while parked), raisingerror.Canceled(a bare-> !; set inferred).cancel(f)flips the sticky flag, marks.canceled, andready(.{handle=f.task})s the worker. - Worker is failable
Closure() -> ($R, !): theasynccompletion closuref.value = worker() catch { … }(the captured-failable-closure-call the Phase-3-prereq fix enabled) marks.canceled/.failedand wakes the awaiter; the worker's post-suspend side effects never run. New failableio.sleep(ms)(arm_timer +try suspend_raw) is the cancellation point. - Compiler gap fixed: a
-> !fn whose only error source istry-ing a protocol method (io.suspend_raw) was wrongly flagged "declared!but never errors".collectErrorSites(error_analysis.zig) now sets adynflag for atryof a non-identifier callee (opaque error channel), suppressing the warning. - Two UAFs found by adversarial review and FIXED: (1) cancel-before-park
orphaned
io.sleep's armed timer →suspend_raw's pre-park raise now evicts the current fiber's timer/waiter first. (2)cancel(f)woke a possibly-reaped worker → now only wakes whenwas_pending(.pendingbefore the store). - Migrated 1805/1806/1824 to failable workers. Lock:
examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx(seq: 1 -99— post-suspend line never runs). Validated byte-identical on aarch64-macOS host AND aarch64-linux container (1824 + 1825). Suite 853/0. Expected.irchurn (SpawnOpts layout) regenerated; no non-.irsnapshot changed.
- Cancel-flag back-ref (D4 — back-ref pointer, chosen):
-
Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE. The async completion closure (
b.run = () => { f.value = worker() catch {…} }) captures a failableworkerand consumes its error channel; the free-variable capture analysis (collectCapturesinsrc/ir/lower/closure.zig) did not descend into the error-handling / context / asm / multi-assign nodes, soworkerwas never captured — inside the lambda it resolved against an empty scope and typed as.unresolved(catch/trythen rejected it). Fixed: addedtry_expr,catch_expr,onfail_stmt,raise_stmt,multi_assign,push_stmt,comptime_expr,insert_expr,spread_expr,asm_exprarms tocollectCaptures. Adversarially reviewed (captures resolve, locals correctly excluded, no false-positive captures, 851/0). Lock: exampleexamples/closures/0314-closures-capture-failable-call.sx(catch + try over a captured failable closure; pure language feature, host-only). Thepush_stmtarm also fixes the previously-noted "free-var analysis doesn't descend into a nestedpush Context {…}" gap. Phase 3 is now unblocked.- Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked
Phase 3): (1) calling a closure stored in a struct data field typed as
unresolved(value → garbage; failable → can'tcatch) — RESOLVED (issues/0201):CallResolver.plangained a closure/fn-pointer field arm and the lowering closure-field arm now also handles bare.functionfields; regressionexamples/closures/0315-closures-struct-field-call.sx. (2) asm write-through place through a deref (asm { … "+r" -> @(p.*) }) fails LLVM verification — repros with NO closure (independent of capture analysis); possibly an unsupported deref-place form rather than a confirmed bug, not filed.
- Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked
Phase 3): (1) calling a closure stored in a struct data field typed as
Status (2026-06-27)
- Phase 0 — fibers inherit the spawn-time context. DONE (
2f2d7f1d). Discovered during Phase 1: a fiber body ran under__sx_default_context(theabi(.c)fib_dispatchdropped the implicit context), so a scheduler installed ascontext.iowas invisible inside a worker. Fixed:Scheduler.spawnsnapshotscontext→Fiber.dctx;fib_dispatchre-pushes it. Behavior-preserving (suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822. - Phase 1 —
impl Io for Scheduler. DONE (5c30bfe0, hardenedda7dd1f1). Six methods over the fiber primitives;spawn_rawbridges the erased(*void)->voidworker thunk via an fn-ptr round-trip. Lock: example 1823 (spawn→arm→suspend→ready→resume entirely throughcontext.io, deterministic). Adversarial review fixed:arm_timer/spawn_rawnull guards,pollfd-pending abort +deadline_msdoc, stalefib_dispatchcomment. - Resolved design decisions: D1 = direct
impl Io for Scheduler(chosen). D2 =now_msreturns the virtualclock_ms(deterministic) — a real-clock variant is later. D4 = deferred to Phase 3. - Phase 2 —
async/awaitcolorblind over the fiber Io. DONE (967aed67, hardenedada8d162).asyncheap-allocs a*Future, boxes a completion closure in a monomorphicThunkBox, and submits viaio.spawn_raw(inline underCBlockingIo, a fiber under the scheduler);awaitparks viasuspend_rawuntil ready. Protocol changed tosuspend_raw(park: *ParkToken)(write-back of the awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adoptedpush .{ … }. Lock: example 1824 (deferral visible:1 2 10 20 123). Review fixed: one-awaiterawaitguard; documented the Future allocator-lifetime contract + thatcanceldoesn't stop an already-spawned worker (Phase 3).- Resolved D2 (ParkToken):
suspend_raw(*ParkToken)write-back (chosen over a registry). ready() liveness (CONCERN 6): safe for single async/await (awaiter is suspended, not reaped, when readied);racefan-in must still deregister (Phase 4). - Carried to convergence:
asyncshould capture the scheduler's long-lived allocator (likesched.go'sown_allocator) instead of the call-sitecontext.allocator— needs a protocol affordance; documented as a contract for now.
- Resolved D2 (ParkToken):
- Open for later phases:
- ParkToken↔fiber binding.
ready(park)needspark.handle= the awaiter*Fiber. The scheduler knowsself.currentat suspend; the cleanest issuspend_raw(park: *ParkToken)writingpark.handle = self.currentbefore parking (a small protocol change: the materializer installs thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry. ready()liveness (review CONCERN 6). Casting a stale/reaped*Fiberhandle andwake-ing it is a latent UAF once realawaitruns —wake's.suspendedvalue-check on freed bytes is luck, not safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
- ParkToken↔fiber binding.
- Out-of-scope compiler bug found by review (not filed yet): closure free-var analysis does not
descend into a nested
push Context {…}block inside a closure body — a var used only there reportsunresolved. Phase 0 sidesteps it (capture is at theFiberlevel, not via closure), so it does NOT block the unification; worth anissues/entry in a separate session.
Phases (each: implement → lock with an example → zig build test green → both platforms)
-
impl Io for Scheduler(the vehicle). Implement the six methods over the fiber primitives. Add aFiber.canceled/task back-ref sosuspend_rawcan raise on resume. KeepCBlockingIointact. Lock: install the fiber Io intocontext.io, run a root fiber thatsuspend_raws and isready()'d — asserts real park/resume through the protocol (not inline). Bridge (the one fiddly bit):async's genericClosure(..$args) -> $Rworker →spawn_raw's rawentry/arg. Box the worker thunk on the heap;entryis a C-ABI(env: *void) -> voidinvoke-thunk (mirrorsfib_dispatch),argis the env. -
async/awaitover the fiber Io (real interleaving). Under a suspending Io,asynccallsspawn_rawand returns a PENDINGFuture($R)(no longer born.ready); the spawned body fillsf.value/f.stateandready(f.park)s the awaiter.await(f)checks.readyelsesuspend_raw(f.park)then returns/raises — the suspending sibling of today's immediateawait.CBlockingIokeeps the run-inline path (degenerate, still correct). Lock: twocontext.io.asynctasks interleave under the fiber Io (the io.sx layer, replacing the bespokesched.go). -
True cancellation via
suspend_raw -> !.cancel(f)flipsf.canceledANDready(f.park)s / wakes the worker fiber so its NEXTsuspend_rawraisesIoErr.Canceled. The worker's suspends (await, a futureio.sleep) propagate viatry/!; the worker body unwinds, the future ends.canceled, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend (assert via a shared log: the post-suspend line never prints). -
raceover Futures —context.io.race((a: fa, b: fb)). Re-home the proven race logic (winner scan, deregister-all-on-wake, structured cancel+join of losers) fromsched.race(*Task tuple)onto*Futurehandles + theIoprotocol. The type-level machinery ports UNCHANGED —RaceResult($T),make_variant, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps*Task→*Futureandsuspend_self→suspend_raw/ready. Cancellation of losers now uses Phase 3 (their next suspend raises), soracereturns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 atcontext.io.race; assert winner value + losers' work stopped (not merely flagged). -
Converge — retire the bespoke fiber async API. Fold
sched.go/wait/cancel/raceinto the io.sx layer;Schedulerstays as the fiber Io's engine + driver. Migrate 1811–1821 to thecontext.ioAPI. One async stack, all behind the protocol. Update the roadmap/checkpoints.
Open decisions (need a call before/within the phase noted)
- D1 (Phase 1) —
impl Io for Schedulervs aFiberIowrapper. Direct impl makescontext.ioBE the scheduler (xx scheduleras the Io value, stateful receiver — mirrors the allocatorxx localrule). A wrapper adds a level but decouples the public Io vtable from the scheduler internals. Lean: direct impl (simplest, matches the allocator convention). - D2 (Phase 1) — virtual vs real clock under the fiber Io. Tests need the deterministic virtual clock
(
clock_ms); a real deployment wantstime.mono_ms. Thread it as a Scheduler mode, or two Io impls (FiberIovirtual-clock for tests, real-clock for prod). Lean: aclock: enum { virtual; real }field so one impl serves both; tests pin.virtual. - D3 (Phase 2) —
Future(void)(issue 0150 SIGTRAP). Avoid-result task can't buildFuture(void)today. Defer (race/async target non-void), or fix thevoidstruct-field path. Lean: defer, gate with a diagnostic. - D4 (Phase 3) — where the cancel flag lives. The
Futurealready hascanceled: Atomic(bool); the fiber needs to reach it fromsuspend_raw. GiveFibera*Atomic(bool)back-ref to its future's flag (set atspawn_raw), sosuspend_rawconsults it with no per-suspend lookup. Lean: back-ref pointer.
Validation (every phase)
zig build && zig build testgreen (full corpus).- New/changed
18xxexamples byte-identical on aarch64-macOS host AND aarch64-linux container (deterministic virtual clock). - Adversarial review of each phase (worker + read-only reviewer), per the session workflow.
What this supersedes
sched.sx's bespokego/wait/cancel/race(Phase 5 retires them; the proven logic moves onto the protocol). The just-landedrace(commit9099735e) is the reference logic for Phase 4, not the final home.- PLAN-RACE.md's "race on
sched.Scheduler" framing — this plan moves it ontocontext.ioper the roadmap's §A5 / §4.6 design-of-record.