fibers: checkpoint + plan for B1.5a/B1.4a; next is B1.4b (deterministic-sim Io)

This commit is contained in:
agra
2026-06-21 18:44:11 +03:00
parent 8367ad18b1
commit 02ab077bfb
2 changed files with 239 additions and 32 deletions

View File

@@ -4,7 +4,86 @@ Companion to [PLAN-FIBERS.md](PLAN-FIBERS.md). Update after every step (one step
per the cadence rule). New corpus category: `18xx` concurrency.
## Last completed step
**B1.3b-1 — the x86_64 / Win64 `swap_context` sibling — VALIDATED on real hardware.** The
**B1.4a — a truly-SUSPENDING fiber-task async layer (`go`/`wait`/`cancel`) — landed +
adversarially reviewed; cleared two more compiler blockers en route.** `library/modules/std/sched.sx`
now carries `Task($R)` + `Scheduler.go(work) -> *Task($R)` + `wait`/`cancel` (a `ufcs` layer over
the M:1 scheduler). `s.go(work)` runs the nullary thunk `work` as a REAL fiber; `t.wait()` SUSPENDS
the caller until it completes (vs io.sx's blocking `context.io.async`, which runs inline). Locked by
`examples/concurrency/1813-concurrency-fiber-async-suspend.sx`: two tasks interleave (A yields
mid-body so B runs first → `1 2 3`), awaited values `42`/`100`, and a canceled task's `wait` raises
`.Canceled``or -99``sequence: 1 2 3 42 100 -99`.
- **Design: a NULLARY thunk, not `async(worker, ..args)`.** A comptime variadic pack can't cross a
deferred (fiber) boundary — `..args` captured into a closure re-expands from the spawner's
now-gone locals (issue 0156 Part 2). So `go` takes `work: Closure() -> $R`; the user captures
inputs in the lambda at the call site (the `go func(){…}()` idiom). **Self-contained in sched.sx**
(NOT io.sx): io.sx importing sched.sx duplicates the `_fib_tramp` global asm when a program also
imports sched.sx directly (global asm emits per import-path) — so the Io-protocol
`spawn_raw`/`suspend_raw`/`ready` hooks stay reserved for the future M:N model; M:1 uses
`go`/`wait` directly. Heap `*Task` (must outlive `go`'s frame; leak documented). `TaskErr` is
LOCAL (the `!` failable detection doesn't see through io.sx's `IoErr` re-export alias).
- **Two compiler blockers hit + FIXED (user-authorized in-session):**
- **issue 0156 Part 1** — a single-type generic `$R` (parsed as `comptime_pack_ref`) used as a
type-arg (`Box($R)`, `size_of(Box($R))`) inside a pack-fn body hit a missing arm in
`resolveTypeWithBindings``.unresolved` → LLVM panic. Fix: mirror `resolveTypeArg`'s
`comptime_pack_ref` arm (look up `type_bindings`, else a loud diagnostic). Regression
`examples/generics/0216-generics-typearg-in-pack-fn-body.sx`. (Part 2 — deferred `..` spread
crashes — reframed OPEN/non-blocking, `issues/0156`.)
- **issue 0157** — a user generic `ufcs` method whose name collides with a stdlib re-export
(`cancel` on `*Task` vs io.sx's `cancel` on `*Future`) resolved via last-wins `fn_ast_map` with
NO receiver filtering → wrong overload → `$R` unbound → LLVM panic. Fix
(`src/ir/lower/call.zig` `selectUfcsGenericByReceiver`): every generic-ufcs dispatch enumerates
ALL module authors (`module_decls`), keeps receiver-binding ones, picks the most
receiver-SPECIFIC (concrete > bare `$T`), dedups re-exports, and flags a genuine 2-specific tie
as a deterministic "ambiguous — qualify" diagnostic (never a silent order-dependent pick).
Regression `examples/generics/0217-generics-ufcs-method-name-collides-stdlib.sx`.
- **Adversarial review (worker) of the 0157 fix + Task layer.** Caught the determinism CRITICAL
(fixed: always-run selection + specificity + ambiguity), `wait`-outside-a-fiber null-deref (fixed:
loud guard in `suspend_self`/`yield_now`), and cancel-doesn't-skip-work (fixed: worker skips
`work()` if already canceled). Lost-wakeup / cancel-after-complete / reap traced safe. Also
simplified `1812` (`**Fiber` shared handle → a `Sh.parked` field; output identical).
- Suite GREEN 751/0 (749 + 1813 + 0217). Next: **B1.4b** (deterministic-sim `Io`).
### Earlier — B1.5a — the M:1 cooperative fiber scheduler CORE — landed + adversarially reviewed
The hand-bootstrapped ping-pong (1807-1810) is now a reusable scheduler API in pure sx:
`library/modules/std/sched.sx` — a generic `Fiber` (`body: Closure() -> void`) + `Scheduler`
with `init`/`spawn`/`yield_now`/`suspend_self`/`wake`/`run` over the proven `swap_context` on
guarded `mmap` stacks. The ONE generic dispatch (`fib_dispatch`, reached from the `_fib_tramp`
trampoline) runs ANY stored closure body on a fresh stack — replacing the fixed `bl _fib_body`.
Reaping `munmap`s the stack + frees the heap `Fiber` on completion; an intrusive FIFO gives
round-robin order.
- **Foundational design de-risked by probe before building:** a fiber can store + call a
`Closure() -> void` on its fresh stack via the generic dispatch; outputs flow OUT through
pointers captured in the closure (capture-by-value does NOT write back — pushed onto the user).
- **Hit + FIXED a blocker compiler bug — issue 0154** (user-authorized in-session fix). `null` /
`---` assigned to a struct field picked up a leaked enclosing `target_type` (the function's
RETURN type, set for the whole body at decl.zig:2691) and built a WHOLE-STRUCT-typed null →
an oversized `zeroinitializer` store through the field's GEP that overran the field's slot and
clobbered the saved x29/x30, so the fn `ret`'d to 0x0. This was EXACTLY the `Scheduler.init()`
by-value-return shape (`sched_ctx: [13]u64` before `current: *Fiber`). Fix: added
`.null_literal, .undef_literal` to the `needs_target` switch in `lowerAssignment`
(`src/ir/lower/stmt.zig`) so the field's type is used. Repro → regression test
`examples/types/0193-types-sret-array-before-pointer.sx`; `issues/0154-*.md` RESOLVED.
- **Adversarial review (worker): asm/bootstrap/lifetime SOUND** (the headline closure-env-lifetime
fear was disproven — envs are heap-promoted, survive the spawn frame). Found **1 CRITICAL** +
robustness gaps, ALL hardened: (CRITICAL) `wake` re-enqueued an already-queued fiber →
FIFO corruption/segfault → now GUARDED on `.suspended` (spurious/double/stale wake = safe
no-op); orphan-suspend leak/deadlock → `n_suspended` accounting + a loud `run()`-drain
diagnostic+abort; `mmap` `MAP_FAILED` (=-1, not null) / `mprotect` / Fiber-OOM → loud bails
(per §8.1.1 the guard is mandatory); the per-fiber closure-env leak (sx exposes no env-free) →
documented as a KNOWN LIMITATION (bounded by spawn count; invisible under the default GPA).
- **Locked two `18xx` examples** (aarch64-macos `.build`-pinned, ir-only on a mismatch):
`1811-concurrency-fiber-scheduler.sx` (3 fibers round-robin via `yield_now` → ordering contract
`sequence: 0 1 2 0 1 2 0 1 2`, all `.done`) + `1812-concurrency-fiber-suspend-wake.sx` (park via
`suspend_self`, resumed by another fiber's `wake`, + the spurious-wake no-op — the CRITICAL-fix
regression → `log: 10 20 21 11` / `suspended-left: 0`).
- **Filed issue 0155 (NON-blocking, NOT fixed)** — found incidentally in the review: indexing a
scalar pointer (`pc[0]`, `pc: *i64`) panics codegen (`.unresolved` reaching LLVM emission). The
scheduler uses array-field indexing + `.*`, never this, so it's filed for its own session.
- Suite GREEN **748/0** (746 base + 1811 + 1812 + 0193 regression). Next: **B1.4a** (FiberIo —
wire `Io.spawn_raw`/`suspend_raw`/`ready` onto the scheduler so `async`/`await` truly suspend).
### Earlier — B1.3b-1 — the x86_64 / Win64 `swap_context` sibling — VALIDATED on real hardware
The
context switch is now proven on a SECOND architecture + ABI. A Win64 `swap_context` saves the
COMPLETE Win64 callee-saved set — 8 GP (rbx, rbp, rdi, rsi, r12-r15) + rsp **and xmm6-xmm15**
(10 XMM, 128-bit via `movups` — Win64 has callee-saved XMM, unlike SysV/aarch64) — plus a Win64
@@ -178,7 +257,39 @@ body); closed + locked. The review's `.naked`-lambda CRITICAL was a false positi
(unparseable — `isLambda` breaks on the `abi` keyword).
## Current state
**B1.2 COMPLETE.** The full async surface (Io capability on Context + `async`/`await`/`cancel` +
**B1.4a COMPLETE — truly-suspending fiber-task async exists.** `library/modules/std/sched.sx` carries
the M:1 scheduler core (B1.5a) PLUS the async-task layer: `Task($R)` + `Scheduler.go(work) ->
*Task($R)` + `wait`/`cancel`. `s.go(work)` spawns a nullary thunk as a fiber; `t.wait()` suspends
the caller until it completes. Locked by `1813` (`sequence: 1 2 3 42 100 -99` — real interleave +
awaited values + cancel). Two compiler blockers fixed en route (0156 Part 1 — `$R` type-arg in a
pack-fn; 0157 — UFCS generic name collision), both regression-tested (`0216`, `0217`). Adversarially
reviewed; determinism + non-fiber-wait + cancel-skip-work all hardened. The io.sx blocking
`context.io.async` (1805/1806) is untouched and coexists. Suite GREEN 751/0.
The remaining B1.4 work: **B1.4b** the deterministic-sim `Io` (virtual clock + timer min-heap,
calibrated against blocking — the KEYSTONE test harness), **B1.4c** the event-loop `Io`
(kqueue/epoll). Then **B1.5** end-to-end M:1 validation under the deterministic `Io`. NOTE: the
suspending async lives as `sched.go`/`wait` (M:1, receiver-driven), NOT routed through the erased
`context.io` (which would force sched.sx into every std consumer + duplicate the `_fib_tramp` global
asm); the `Io` protocol's `spawn_raw`/`suspend_raw`/`ready` remain reserved for the M:N evolution.
### Earlier — B1.5a COMPLETE — the M:1 scheduler CORE exists
`library/modules/std/sched.sx` drives N fibers
(generic `Closure() -> void` bodies) cooperatively over the proven `swap_context`, on guarded
`mmap` stacks: `spawn` / `yield_now` (round-robin) / `suspend_self` + `wake` (off-queue park/resume)
/ `run` (drives to drain, reaps on `.done`). Adversarially reviewed + hardened (wake guarded, loud
mmap/mprotect/OOM/deadlock bails, env-leak documented). Locked by `1811` (round-robin ordering
contract) + `1812` (suspend/wake park-resume + spurious-wake guard). Suite GREEN **748/0**.
The remaining B1.4 work wires this scheduler under the `Io` capability: **B1.4a (FiberIo)** makes
`context.io` route `spawn_raw`/`suspend_raw`/`ready` onto the `Scheduler` so `async`/`await` truly
SUSPEND (today's `CBlockingIo` runs the worker to completion inline); **B1.4b** the deterministic-sim
`Io` (virtual clock + timer queue, calibrated against blocking — the KEYSTONE test harness);
**B1.4c** the event-loop `Io` (kqueue/epoll). Then **B1.5** is the end-to-end M:1 validation under
the deterministic `Io`.
### Earlier — B1.2 COMPLETE
The full async surface (Io capability on Context + `async`/`await`/`cancel` +
blocking `CBlockingIo`) works end-to-end. Master GREEN (732/0), installed `sx` clean. All four
B1.2 surface bugs resolved or deferred:
- **0151 fixed** (`362674f`): generic `$T` through generic-struct / pointer / UFCS-pack params.
@@ -252,13 +363,24 @@ fibers/Io/scheduler code yet. Grounded floor facts:
boundary; a sharper sx diagnostic for it is a candidate polish, not a blocker.
## Next step
**→ B1.4 — `Io` impls / the scheduler.** The switch substrate is proven on TWO arch/ABI pairs
(aarch64 native + x86_64/Win64 on the VM), with the §10.7 stress gate, guarded mmap stacks, and
adversarial review. That's enough to build the scheduler on. B1.4 builds the deterministic-sim
`Io` (calibrated against blocking `Io` before trusting it — §8.1.3), then **B1.5** (M:1 scheduler)
replaces the hand-bootstrapped ping-pong with real `spawn`/`yield`/`resume` over the switch. The
§10.7 gate (1808) + guarded-stack path (1809) + the Win64 sibling (1810) must keep passing as the
switch is wrapped into the scheduler.
**→ B1.4bthe deterministic-sim `Io` (the KEYSTONE test harness).** B1.4a (suspending fiber-task
async, `sched.go`/`wait`) is done. Now build a deterministic `Io` impl: a virtual clock (`now_ms`
returns simulated time), a timer min-heap (`arm_timer` schedules a wake at a sim deadline), and
`poll` advances the clock to the next due timer and wakes its parked fiber. Drive it over the M:1
scheduler so a program using sim-time sleeps/timeouts runs fully deterministically. **Calibrate it
against blocking `Io`** (§8.1.3): the same program under blocking vs deterministic `Io` must produce
the same observable result before the deterministic one is trusted to gate async tests. Lock with an
`18xx` example asserting a program-emitted ORDERING contract (sim-time scheduling), aarch64-pinned
(`.build {"target":"macos"}`). This harness gates B1.5 + Stream B2.
Then: **B1.4c** event-loop `Io` (kqueue mac / epoll linux — real fd readiness), **B1.5** end-to-end
M:1 validation under the deterministic `Io`. The §10.7 gate (1808) + guarded-stack (1809) + Win64
(1810) + scheduler (1811/1812) + async (1813) must keep passing throughout.
Open design question for B1.4b/c: a deterministic/event-loop `Io` needs a current-`Scheduler`
handle to park/wake. `sched.go`/`wait` thread it via the `Task`; an `Io` impl that wants the same
will likely need an ambient current-scheduler accessor in sched.sx (deferred from B1.4a — the
`Task`-threaded form sufficed). Decide when wiring `arm_timer` → a parked fiber.
**Side thread (optional, low priority): the SysV/Linux x86_64 sibling.** A THIRD switch variant
for `x86_64-linux`: SysV callee-saved = rbx, rbp, r12-r15 + rsp (6 GP + sp; **no** callee-saved
@@ -275,6 +397,37 @@ incomplete); a dedicated effort; lambda workers are the idiom meanwhile.
`call.zig:1229`, io last). Io protocol + materializers + push-inherit are LANDED + reviewed.
## Known issues / capability gaps
- **issue 0157 (OPEN, BLOCKING B1.4a)** — a user-defined generic ufcs method whose NAME collides
with a stdlib re-export (`cancel`, re-exported by `std.sx` from `io.sx` as `ufcs (f: *Future($R))`),
called via UFCS on a different generic struct (`*Task($R)`), leaves `$R` unresolved → `.unresolved`
reaches LLVM emission → panic (`src/backend/llvm/types.zig:196`). Renaming → works; the non-UFCS
call form already diagnoses `cannot infer generic type parameter 'R'`, so the UFCS path skips that
diagnostic. Surfaced by `cancel :: ufcs (t: *Task($R))` in `std/sched.sx`. Minimal repro (no
fibers/closures): `issues/0157-ufcs-generic-method-name-collides-stdlib-unresolved.{md,sx}`.
- **✅ issue 0154 — FIXED** (`null`/`---` to a struct field over-stored a whole-struct null when
the function's return type leaked as `target_type`, corrupting the frame → `ret` to 0x0;
surfaced building `Scheduler.init()`'s by-value return). Fix: `.null_literal`/`.undef_literal`
added to `needs_target` in `lowerAssignment` (`src/ir/lower/stmt.zig`). Regression:
`examples/types/0193`.
- **issue 0155 (OPEN, NON-blocking)** — indexing a scalar pointer (`pc[0]`, `pc: *i64`) panics
codegen (`.unresolved` reaching LLVM emission, `src/backend/llvm/types.zig:196`). Found in the
B1.5a review; the scheduler doesn't use it (array-field index + `.*` only). Filed for its own
session: `issues/0155-scalar-pointer-index-llvm-panic.{md,sx}`.
- **✅ issue 0157 — FIXED** (B1.4a) — a user generic `ufcs` method whose name collides with a
stdlib re-export resolved via last-wins `fn_ast_map` with no receiver filtering → wrong overload →
`$R` unbound → LLVM panic. Fix: `selectUfcsGenericByReceiver` (`src/ir/lower/call.zig`) — most
receiver-specific binding author across ALL module authors, deterministic, ambiguity-diagnosing.
Regression: `examples/generics/0217`.
- **✅ issue 0156 Part 1 — FIXED** (B1.4a) — single-type generic `$R` as a type-arg in a pack-fn
body (`Box($R)`/`size_of(Box($R))`) → `.unresolved` → panic. Fix: `comptime_pack_ref` arm in
`resolveTypeWithBindings`. Regression: `examples/generics/0216`.
- **Part 2 (OPEN, NON-blocking)** — a deferred `..` spread (a comptime pack captured into a
closure, or a tuple `..t` spread) crashes instead of working/diagnosing. The fiber async layer
avoids it by design (nullary thunks), so it's filed for its own session: `issues/0156`.
- **Heap leaks in the fiber runtime (documented limitations, NOT bugs):** `spawn`'s closure env +
`go`'s heap `Task` are never freed (sx exposes no closure-env free; Task ownership is deferred).
Bounded by spawn/go count, invisible under the default GPA. Revisit for a long-running
arena-backed scheduler.
- **✅ issue 0153 — FIXED** (re-exported generic value-failable `($R, !E)` kept its `!` channel:
`inferGenericReturnType` now pins return-type resolution to the fn's defining module).
Regression: `examples/1058`. Was the LAST B1.2 surface blocker.
@@ -476,3 +629,44 @@ incomplete); a dedicated effort; lambda workers are the idiom meanwhile.
The B1.3 context switch is now proven on TWO arch/ABI pairs. Next: **B1.4** (Io impls / M:1
scheduler) on the proven substrate. (Side thread: the SysV/Linux x86_64 sibling, when a Linux
x86_64 host is available.)
- **B1.5a — M:1 scheduler CORE + a fixed blocker bug.** Built `library/modules/std/sched.sx`: a
generic `Fiber`/`Scheduler` over `swap_context` on guarded `mmap` stacks. `spawn` heap-allocs a
fiber, bootstraps its ctx, enqueues it; the ONE generic dispatch (`fib_dispatch` via `_fib_tramp`)
runs ANY stored `Closure() -> void` on a fresh stack (replacing the fixed `bl _fib_body`);
`yield_now` round-robins, `suspend_self`/`wake` park/resume off-queue, `run` drives to drain +
reaps `.done` fibers (`munmap` + free). **De-risked first by probe** (closure-on-fiber + output
via captured pointer). **Hit blocker bug 0154** (user-authorized fix): `null`/`---` to a struct
field over-stored a whole-struct null when the fn return type leaked as `target_type`, corrupting
the frame (`ret` 0x0) — exactly the `Scheduler.init()` by-value-return shape. Fixed in `stmt.zig`
(`needs_target` += `null`/`undef` literals); regression `examples/types/0193`; `0154` RESOLVED.
**Adversarial review:** asm/bootstrap/lifetime sound (env-lifetime fear disproven — heap-promoted);
1 CRITICAL (`wake` re-enqueue → FIFO segfault) + robustness gaps ALL hardened (wake guarded on
`.suspended`, `n_suspended` deadlock diagnostic+abort, loud mmap/mprotect/OOM bails, env-leak
documented). Locked `1811` (round-robin `0 1 2 ×3`) + `1812` (suspend/wake + spurious-wake guard,
`log: 10 20 21 11`). Filed NON-blocking `0155` (scalar-pointer index panics codegen — review
incidental, unused by sched). Suite GREEN **748/0**. Next: **B1.4a** (FiberIo).
- **B1.4a (truly-suspending fiber-task async, nullary-thunk design) — BLOCKED on issue 0157.**
Implemented the async layer SELF-CONTAINED in `library/modules/std/sched.sx` (kept its lone
`#import "modules/std.sx"` to avoid the duplicate-`_fib_tramp` trap): `TaskState`, a LOCAL
`TaskErr :: error { Canceled }` (the re-exported `IoErr` alias is NOT seen through by the
`raise`/failable-type check — verified), `Task($R)`, and `go`/`wait`/`cancel` ufcs. Design is
the validated nullary-thunk (`.sx-tmp/pnullary.sx` → `log: 1 2 3 42 100`): `work` is a
`Closure() -> $R`, user captures inputs at the call site, NO `..args` crosses the fiber boundary
(deliberately sidesteps 0156). `go`+`wait` run correctly; both wake-orderings traced. Wrote the
example `examples/concurrency/1813-concurrency-fiber-async-suspend.sx` (+ `{ "target": "macos" }`
`.build`) but its `cancel` ufcs surfaced a NEW compiler bug — issue **0157**: a user generic
ufcs whose name collides with a stdlib re-export (`cancel` from io.sx) is mis-resolved on UFCS
call over a different generic struct, leaving `$R` unresolved → LLVM panic. Bisected to a minimal
no-fiber repro (name is the sole trigger; non-UFCS form diagnoses correctly). Example NOT seeded
into the corpus (no `.exit` marker) — do NOT regen its goldens until 0157 lands. Per the STOP
rule: filed `issues/0157-*.{md,sx}`, marked state BLOCKED, paused.
- **B1.4a COMPLETE (this session) — suspending fiber-task async + two compiler fixes.** Built the
`Task($R)` + `go`/`wait`/`cancel` layer in `sched.sx` (nullary-thunk design; self-contained to
avoid the `_fib_tramp` duplicate-symbol trap). Locked `1813` (`sequence: 1 2 3 42 100 -99`).
FIXED the two blockers the worker had filed: **0156 Part 1** (`comptime_pack_ref` arm in
`resolveTypeWithBindings`; regression `0216`) and **0157** (receiver-driven UFCS overload
selection `selectUfcsGenericByReceiver`; regression `0217`). Adversarial review of the 0157 fix +
Task layer found a determinism CRITICAL (always-run selection + specificity + ambiguity
diagnostic), a `wait`-outside-fiber null-deref (loud guard), and cancel-not-skipping-work (skip
if pre-canceled) — all fixed. Simplified `1812` (`**Fiber` → `Sh.parked`). 0156 Part 2 reframed
OPEN/non-blocking. Suite GREEN **751/0**. Next: B1.4b (deterministic-sim `Io`, the KEYSTONE).