Files
sx/current/PLAN-IO-UNIFY.md
agra aae7d72a66 refactor: retire bespoke Task async; one stack behind context.io (Phase 5)
Converge the Io unification (PLAN-IO-UNIFY Phase 5). The bespoke fiber-task layer
in sched.sx — Task / TaskState / TaskErr / go / wait / cancel(Task), plus
Scheduler.task_allocs and its deinit bookkeeping (~130 lines) — is removed. There
is now ONE async stack: context.io.async / await / cancel / race / sleep over the
Io protocol, with the Scheduler as the fiber Io's engine + driver (spawn /
yield_now / suspend_self / wake / run / block_on_fd remain as the raw primitives;
race stays in sched.sx because it needs meta.sx's make_enum/make_variant).

Migrated the four go/wait users to context.io:
- 1813 — interleave + cancel (sequence 1 2 3 42 100 -99)
- 1817 — m1 end-to-end (completion in deadline order, sum 123)
- 1819 — double-AWAIT loud-abort via the Future one-awaiter guard
- 1820 — deinit: dropped the go/task_allocs tasks; now exercises timers/io_waiters/
  kq cleanup (freed=2, live=3 = the documented per-spawn closure-env residual)

Updated readme.md (the user-facing async section documents context.io.async /
await / race / sleep) and the stale sched.go/sched.Task comments in io.sx.

Suite 854/0; no .ir churn (Task removal touched no snapshotted IR); migrated
examples byte-identical on aarch64-macOS + aarch64-linux. PLAN-IO-UNIFY Phases 0-5
all complete — the two parallel async stacks are now one, behind context.io.
2026-06-28 10:14:17 +03:00

241 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PLAN-IO-UNIFY — fold the fiber scheduler behind `context.io`, re-home `race`
## Why
Today there are **two parallel async stacks**:
| stack | behind `context.io`? | real suspension? | cancellation channel |
|---|---|---|---|
| io.sx `async`/`await`/`cancel`/`Future` | yes (`impl Io for CBlockingIo`) | **no** — runs the worker inline to completion | `suspend_raw -> !` / `IoErr.Canceled` (designed, unused) |
| sched.sx `go`/`wait`/`cancel`/`race` (just landed) | **no** | yes (`swap_context` fibers) | none — `suspend_self -> void` |
`context.io` is structurally Zig's `std.Io` (an `Io` protocol carried *implicitly* in `Context` — better
ergonomics than Zig's explicit `io:` param), and the roadmap (§A5, §4.6) already says the fiber
scheduler should be **one of its `Io` vtables** and that `race` is **`context.io.race(..)` over Futures**.
The just-landed `race` on `sched.Scheduler` over `*Task` is the proven LOGIC at the wrong LAYER.
**Goal:** make the fiber `Scheduler` an `impl Io`, lift `async`/`await`/`cancel`/`race` onto the `Io`
protocol so they run colorblind under either impl, and let cancellation fall out of the existing
`suspend_raw -> !` contract (the "true cancellation, model A" the user picked — already the interface's
design). One async stack, behind `context.io`.
## The fiber → `Io` mapping (the crux)
`Io :: protocol { spawn_raw, suspend_raw -> !, ready, poll, now_ms, arm_timer }` (core.sx). Map each onto
the existing fiber primitives in sched.sx (`spawn`/`suspend_self`/`wake`/`sleep`/`block_on_fd`/`run`):
| `Io` method | fiber realization |
|---|---|
| `spawn_raw(entry, arg, opts) -> *void` | `spawn` a fiber whose body invokes `entry(arg)` (raw C-ABI thunk, not a closure — see Bridge below). Returns the `*Fiber` as the opaque handle. |
| `suspend_raw(park) -> !` | `suspend_self()`, then on resume CHECK the current task's cancel flag and `raise IoErr.Canceled` if set. `park.handle` = the `*Fiber` to re-ready. **This is the cancellation delivery point.** |
| `ready(park)` | `wake(park.handle as *Fiber)` (already guarded on `.suspended`). |
| `arm_timer(deadline_ms, park) -> *void` | arm a `Timer{deadline, fiber=park.handle}` (today's `sleep` minus the self-suspend); return the timer handle so a cancel can evict it. |
| `poll(deadline_ms) -> i64` | ONE iteration of the `run` loop: drain ready, then fire the earliest timer / block on fds up to `deadline_ms`. Returns the next pending deadline (or sentinel when idle). |
| `now_ms() -> i64` | the virtual `clock_ms` (deterministic), NOT a wall clock — keeps 1817/1821-style tests reproducible. |
`Scheduler.run()` stays as the explicit DRIVER (the top-level loop that calls `poll` to quiescence),
installed via `push Context { io = xx scheduler } { … s.run(); }` — exactly the existing sched examples,
just with the scheduler now reachable as `context.io`.
## Status (2026-06-28)
- **Phase 5 — CONVERGE: retire the bespoke fiber async API. DONE. Io unification
COMPLETE.** The bespoke `Task` layer (`Task`/`TaskState`/`TaskErr`/`go`/`wait`/
`cancel(Task)` + `Scheduler.task_allocs` and its deinit handling, ~130 lines)
is removed from sched.sx. There is now ONE async stack: `context.io.async`/
`await`/`cancel`/`race`/`sleep` over the `Io` protocol, with the `Scheduler` as
the fiber Io's engine + driver (`spawn`/`yield_now`/`suspend_self`/`wake`/`run`/
`block_on_fd` stay as the raw primitives). Migrated the four `go`/`wait` users to
`context.io`: 1813 (interleave + cancel), 1817 (m1 end-to-end sum=123), 1819
(double-AWAIT loud-abort via the Future one-awaiter guard), 1820 (deinit — the
`go`/`task_allocs` tasks dropped; it now exercises timers/io_waiters/kq cleanup,
`freed=2`/`live=3`). `race` stays in sched.sx (needs meta.sx). Updated readme.md
(the user-facing async section now documents `context.io.async`/`await`/`race`/
`sleep`) and the stale `sched.go`/`sched.Task` comments in io.sx. Suite 854/0; no
`.ir` churn (the Task removal touched no snapshotted IR); migrated examples
byte-identical on aarch64-macOS + aarch64-linux. **PLAN-IO-UNIFY Phases 05 all
complete — the two parallel async stacks are now one, behind `context.io`.**
- **Phase 4 — `race` over Futures via `context.io.race`. DONE.** Re-homed the
proven first-wins race from `sched.race(*Task)` onto `*Future` handles + the
`Io` protocol; the old Task-based `race` is REPLACED (ufcs overload-by-receiver
is rejected — "duplicate top-level decl" — and only 1821 used it).
- **Protocol affordance:** added `Io.current_park() -> ParkToken` (the running
fiber as a token, captured WITHOUT parking) so race can register the SAME
coordinator across N futures' `park` slots, then park once via `suspend_raw`;
any completion `ready`s it. Scheduler returns `{self.current}` (bails outside
a fiber); CBlockingIo returns `{null}` (race never parks there — futures born
`.ready`). The await comment already anticipated this fan-in.
- **race** (`ufcs (io: Io, futures: $T) -> RaceResult(T)`, in sched.sx — it
needs meta.sx's `make_enum`/`make_variant`, and pulling that into the io.sx
prelude part-file would cycle): winner scan → register+park → deregister →
`make_variant` the winner → Phase-3 `cancel` each loser (NO join). `RaceResult`
reused unchanged (`*Future(R)` projects field 0 `value` → R).
- **Winner-time return:** with true cancellation the parked losers stop at their
next suspend (their timers evicted by cancel's wake), so race returns at the
winner's virtual time, not the slowest loser's. 1821 re-pointed to
`context.io.async` + `context.io.race`: `winner a=111`, losers `.canceled`,
completion log ONLY `task 1 @ 10ms`, final clock `10ms` (was 30 under the old
cooperative join). Byte-identical on aarch64-macOS + aarch64-linux. Suite
853/0; `.ir` churn (current_park vtable method) regenerated, only 1821 stdout
changed otherwise.
- **Phase 3 — TRUE cancellation via `suspend_raw -> !`. DONE.** A cancelled async
worker now abandons its body at its next suspend instead of running to
completion. Pieces:
- **Cancel-flag back-ref (D4 — back-ref pointer, chosen):** `SpawnOpts.cancel_flag:
*void` (core.sx) + `Fiber.cancel_flag: *void` (sched.sx), set from
`opts.cancel_flag` in `Scheduler.spawn_raw`. `async` passes `xx @f.canceled`
(the `Future.canceled` `Atomic(bool)` erased to `*void`).
- **Delivery:** `Scheduler.suspend_raw` checks `fiber_canceled(self.current)` (a
`*Atomic(bool)` load) PRE-park (raise without parking — no deadlock if cancel
landed before the worker ran) and POST-resume (cancel landed while parked),
raising `error.Canceled` (a bare `-> !`; set inferred). `cancel(f)` flips the
sticky flag, marks `.canceled`, and `ready(.{handle=f.task})`s the worker.
- **Worker is failable** `Closure() -> ($R, !)`: the `async` completion closure
`f.value = worker() catch { … }` (the captured-failable-closure-call the
Phase-3-prereq fix enabled) marks `.canceled`/`.failed` and wakes the awaiter;
the worker's post-suspend side effects never run. New failable `io.sleep(ms)`
(arm_timer + `try suspend_raw`) is the cancellation point.
- **Compiler gap fixed:** a `-> !` fn whose only error source is `try`-ing a
protocol method (`io.suspend_raw`) was wrongly flagged "declared `!` but never
errors". `collectErrorSites` (error_analysis.zig) now sets a `dyn` flag for a
`try` of a non-identifier callee (opaque error channel), suppressing the
warning.
- **Two UAFs found by adversarial review and FIXED:** (1) cancel-before-park
orphaned `io.sleep`'s armed timer → `suspend_raw`'s pre-park raise now evicts
the current fiber's timer/waiter first. (2) `cancel(f)` woke a possibly-reaped
worker → now only wakes when `was_pending` (`.pending` before the store).
- Migrated 1805/1806/1824 to failable workers. Lock:
`examples/concurrency/1825-concurrency-fiber-cancel-suspend.sx` (`seq: 1 -99`
— post-suspend line never runs). **Validated byte-identical on aarch64-macOS
host AND aarch64-linux container** (1824 + 1825). Suite 853/0. Expected `.ir`
churn (SpawnOpts layout) regenerated; no non-`.ir` snapshot changed.
- **Phase 3 PREREQUISITE — captured-failable-closure call typing. DONE.** The
async completion closure (`b.run = () => { f.value = worker() catch {…} }`)
captures a failable `worker` and consumes its error channel; the free-variable
capture analysis (`collectCaptures` in `src/ir/lower/closure.zig`) did not
descend into the error-handling / context / asm / multi-assign nodes, so
`worker` was never captured — inside the lambda it resolved against an empty
scope and typed as `.unresolved` (`catch`/`try` then rejected it). Fixed: added
`try_expr`, `catch_expr`, `onfail_stmt`, `raise_stmt`, `multi_assign`,
`push_stmt`, `comptime_expr`, `insert_expr`, `spread_expr`, `asm_expr` arms to
`collectCaptures`. Adversarially reviewed (captures resolve, locals correctly
excluded, no false-positive captures, 851/0). Lock: example
`examples/closures/0314-closures-capture-failable-call.sx` (catch + try over a
captured failable closure; pure language feature, host-only). The `push_stmt`
arm also fixes the previously-noted "free-var analysis doesn't descend into a
nested `push Context {…}`" gap. **Phase 3 is now unblocked.**
- Two PRE-EXISTING, orthogonal bugs surfaced during review (neither blocked
Phase 3): (1) calling a closure stored in a **struct data field** typed as
`unresolved` (value → garbage; failable → can't `catch`) — **RESOLVED**
(`issues/0201`): `CallResolver.plan` gained a closure/fn-pointer field arm and
the lowering closure-field arm now also handles bare `.function` fields;
regression `examples/closures/0315-closures-struct-field-call.sx`. (2) asm
write-through place through a deref (`asm { … "+r" -> @(p.*) }`) fails LLVM
verification — repros with NO closure (independent of capture analysis);
possibly an unsupported deref-place form rather than a confirmed bug, not
filed.
## Status (2026-06-27)
- **Phase 0 — fibers inherit the spawn-time context. DONE** (`2f2d7f1d`). Discovered during Phase 1: a
fiber body ran under `__sx_default_context` (the `abi(.c)` `fib_dispatch` dropped the implicit
context), so a scheduler installed as `context.io` was invisible inside a worker. Fixed:
`Scheduler.spawn` snapshots `context` → `Fiber.dctx`; `fib_dispatch` re-pushes it. Behavior-preserving
(suite 828/0), no cross-fiber leak (context is parameter-threaded per stack). Lock: example 1822.
- **Phase 1 — `impl Io for Scheduler`. DONE** (`5c30bfe0`, hardened `da7dd1f1`). Six methods over the
fiber primitives; `spawn_raw` bridges the erased `(*void)->void` worker thunk via an fn-ptr round-trip.
Lock: example 1823 (spawn→arm→suspend→ready→resume entirely through `context.io`, deterministic).
Adversarial review fixed: `arm_timer`/`spawn_raw` null guards, `poll` fd-pending abort + `deadline_ms`
doc, stale `fib_dispatch` comment.
- **Resolved design decisions:** D1 = direct `impl Io for Scheduler` (chosen). D2 = `now_ms` returns the
virtual `clock_ms` (deterministic) — a real-clock variant is later. D4 = deferred to Phase 3.
- **Phase 2 — `async`/`await` colorblind over the fiber Io. DONE** (`967aed67`, hardened `ada8d162`).
`async` heap-allocs a `*Future`, boxes a completion closure in a monomorphic `ThunkBox`, and submits
via `io.spawn_raw` (inline under `CBlockingIo`, a fiber under the scheduler); `await` parks via
`suspend_raw` until ready. Protocol changed to `suspend_raw(park: *ParkToken)` (write-back of the
awaiter). Workers are nullary (call-site capture). Migrated 1805/1806; adopted `push .{ … }`. Lock:
example 1824 (deferral visible: `1 2 10 20 123`). Review fixed: one-awaiter `await` guard; documented
the Future allocator-lifetime contract + that `cancel` doesn't stop an already-spawned worker (Phase 3).
- **Resolved D2 (ParkToken):** `suspend_raw(*ParkToken)` write-back (chosen over a registry). **ready()
liveness (CONCERN 6):** safe for single async/await (awaiter is suspended, not reaped, when readied);
`race` fan-in must still deregister (Phase 4).
- **Carried to convergence:** `async` should capture the scheduler's long-lived allocator (like
`sched.go`'s `own_allocator`) instead of the call-site `context.allocator` — needs a protocol
affordance; documented as a contract for now.
- **Open for later phases:**
- **ParkToken↔fiber binding.** `ready(park)` needs `park.handle` = the awaiter `*Fiber`. The scheduler
knows `self.current` at suspend; the cleanest is `suspend_raw(park: *ParkToken)` writing
`park.handle = self.current` before parking (a small protocol change: the materializer installs
thunks by name/order, signature-agnostic — verified low-risk). Decide vs a token→fiber registry.
- **`ready()` liveness (review CONCERN 6).** Casting a stale/reaped `*Fiber` handle and `wake`-ing it is
a latent UAF once real `await` runs — `wake`'s `.suspended` value-check on freed bytes is luck, not
safety. Phase 2 must guarantee single-ready / deregistration (mirror the bespoke-race deregister).
- **Out-of-scope compiler bug found by review (not filed yet):** closure free-var analysis does not
descend into a nested `push Context {…}` block inside a closure body — a var used only there reports
`unresolved`. Phase 0 sidesteps it (capture is at the `Fiber` level, not via closure), so it does NOT
block the unification; worth an `issues/` entry in a separate session.
## Phases (each: implement → lock with an example → `zig build test` green → both platforms)
1. **`impl Io for Scheduler` (the vehicle).** Implement the six methods over the fiber primitives. Add
a `Fiber.canceled`/task back-ref so `suspend_raw` can raise on resume. Keep `CBlockingIo` intact.
Lock: install the fiber Io into `context.io`, run a root fiber that `suspend_raw`s and is `ready()`'d —
asserts real park/resume through the protocol (not inline). **Bridge** (the one fiddly bit): `async`'s
generic `Closure(..$args) -> $R` worker → `spawn_raw`'s raw `entry/arg`. Box the worker thunk on the
heap; `entry` is a C-ABI `(env: *void) -> void` invoke-thunk (mirrors `fib_dispatch`), `arg` is the env.
2. **`async`/`await` over the fiber Io (real interleaving).** Under a suspending Io, `async` calls
`spawn_raw` and returns a PENDING `Future($R)` (no longer born `.ready`); the spawned body fills
`f.value`/`f.state` and `ready(f.park)`s the awaiter. `await(f)` checks `.ready` else `suspend_raw(f.park)`
then returns/raises — the suspending sibling of today's immediate `await`. `CBlockingIo` keeps the
run-inline path (degenerate, still correct). Lock: two `context.io.async` tasks interleave under the
fiber Io (the io.sx layer, replacing the bespoke `sched.go`).
3. **True cancellation via `suspend_raw -> !`.** `cancel(f)` flips `f.canceled` AND `ready(f.park)`s /
wakes the worker fiber so its NEXT `suspend_raw` raises `IoErr.Canceled`. The worker's suspends
(`await`, a future `io.sleep`) propagate via `try`/`!`; the worker body unwinds, the future ends
`.canceled`, its post-cancel side-effects DON'T run. This is the model-A "true cancellation" — now
delivered through the protocol, not bespoke. Lock: a cancelled task's work stops at its next suspend
(assert via a shared log: the post-suspend line never prints).
4. **`race` over Futures — `context.io.race((a: fa, b: fb))`.** Re-home the proven race logic (winner
scan, deregister-all-on-wake, structured cancel+join of losers) from `sched.race(*Task tuple)` onto
`*Future` handles + the `Io` protocol. The type-level machinery ports UNCHANGED — `RaceResult($T)`,
`make_variant`, the tuple reflection (GAP 1/2, all landed) — only the runtime swaps `*Task`→`*Future`
and `suspend_self`→`suspend_raw`/`ready`. Cancellation of losers now uses Phase 3 (their next suspend
raises), so `race` returns at WINNER-time, not slowest-loser-time. Lock: re-point 1821 at
`context.io.race`; assert winner value + losers' work stopped (not merely flagged).
5. **Converge — retire the bespoke fiber async API.** Fold `sched.go`/`wait`/`cancel`/`race` into the
io.sx layer; `Scheduler` stays as the fiber Io's engine + driver. Migrate 18111821 to the
`context.io` API. One async stack, all behind the protocol. Update the roadmap/checkpoints.
## Open decisions (need a call before/within the phase noted)
- **D1 (Phase 1) — `impl Io for Scheduler` vs a `FiberIo` wrapper.** Direct impl makes `context.io` BE the
scheduler (`xx scheduler` as the Io value, stateful receiver — mirrors the allocator `xx local` rule).
A wrapper adds a level but decouples the public Io vtable from the scheduler internals. *Lean: direct
impl* (simplest, matches the allocator convention).
- **D2 (Phase 1) — virtual vs real clock under the fiber Io.** Tests need the deterministic virtual clock
(`clock_ms`); a real deployment wants `time.mono_ms`. Thread it as a Scheduler mode, or two Io impls
(`FiberIo` virtual-clock for tests, real-clock for prod). *Lean: a `clock: enum { virtual; real }` field
so one impl serves both; tests pin `.virtual`.*
- **D3 (Phase 2) — `Future(void)` (issue 0150 SIGTRAP).** A `void`-result task can't build `Future(void)`
today. Defer (race/async target non-void), or fix the `void` struct-field path. *Lean: defer, gate with
a diagnostic.*
- **D4 (Phase 3) — where the cancel flag lives.** The `Future` already has `canceled: Atomic(bool)`; the
fiber needs to reach it from `suspend_raw`. Give `Fiber` a `*Atomic(bool)` back-ref to its future's flag
(set at `spawn_raw`), so `suspend_raw` consults it with no per-suspend lookup. *Lean: back-ref pointer.*
## Validation (every phase)
- `zig build && zig build test` green (full corpus).
- New/changed `18xx` examples byte-identical on aarch64-macOS host AND aarch64-linux container
(deterministic virtual clock).
- Adversarial review of each phase (worker + read-only reviewer), per the session workflow.
## What this supersedes
- `sched.sx`'s bespoke `go`/`wait`/`cancel`/`race` (Phase 5 retires them; the proven logic moves onto the
protocol). The just-landed `race` (commit `9099735e`) is the reference logic for Phase 4, not the final
home.
- PLAN-RACE.md's "race on `sched.Scheduler`" framing — this plan moves it onto `context.io` per the
roadmap's §A5 / §4.6 design-of-record.