Files
sx/current/PLAN-FIBERS.md
agra 1b0d640f73 fibers: event-loop Io — real fd readiness via kqueue (B1.4c)
A fiber can block on a file descriptor and the run loop blocks on
kevent until the kernel reports it ready. Reuses the existing
std/net/kqueue.sx bindings. Scheduler gains a lazy kq fd + an
io_waiters list; block_on_fd arms a one-shot EVFILT_READ registration,
records an IoWaiter, and suspends. Run-loop Mode 2: when the ready
queue drains and no timer is pending, block on kq_wait(-1), match each
fired ident to its waiter, evict it, wake the fiber. wake evicts a
pending fd-waiter (cancel_io_waiter_for) so no stale IoWaiter outlives
a reaped fiber.

Adversarial review found two CRITICALs: (1) two fibers on the same fd
share one kqueue registration (macOS EV_ADD replaces), so one is lost
and the loop hangs -- fixed by enforcing one-waiter-per-fd with a loud
abort; (2) an fd-waiter on a never-ready fd 'hangs' -- reclassified as
correct event-loop semantics (a server idling on a socket), with the
misleading orphan-check comment corrected. UAF parity, ident width,
EINTR handling, timer/io precedence all probed safe.

Example: 1816 (pipe roundtrip -- reader blocks, writer writes, reader
wakes via kqueue). macOS only; linux epoll twin deferred. Suite green 754/0.
2026-06-21 19:39:16 +03:00

267 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PLAN-FIBERS — Stream B1 (fibers + Io + M:1 scheduler)
> **STATUS: 🚧 in progress.** B1.0 (`abi(.naked)`) ✅ · B1.1 (per-fiber `context`) ✅ · B1.2
> (`Io` interface + `async`/`await`/`cancel` over blocking `CBlockingIo`) ✅ · B1.3 (fiber
> runtime: naked `swap_context` + §10.7 stress gate + guarded `mmap` stacks, proven on aarch64
> AND x86_64/Win64) ✅ · **B1.5a (M:1 scheduler CORE — `std/sched.sx`: `spawn`/`yield_now`/
> `suspend_self`/`wake`/`run`) ✅** (fixed blocker 0154) · **B1.4a (suspending fiber-task async —
> `sched.go`/`wait`/`cancel` over `Task($R)`, nullary-thunk) ✅** (adversarially reviewed; fixed
> blockers 0156-Part1 + 0157 en route; locked `1813`).
> **B1.4b (deterministic virtual-time timers — sched.sleep/now_ms/timer-run) ✅** (reviewed; fixed a CRITICAL timer-vs-early-wake UAF; locked 1814/1815).
> **B1.4c (event-loop — real fd readiness via kqueue: `block_on_fd` + run-loop Mode 2) ✅** (reviewed; fixed a CRITICAL same-fd lost-wakeup hang; locked 1816). macOS only — linux epoll twin deferred.
> **→ NOW: B1.5** — end-to-end M:1 validation under the deterministic timers / fd readiness. Detailed progress in [CHECKPOINT-FIBERS.md](CHECKPOINT-FIBERS.md). NOTE: suspending async +
> deterministic timers live as `sched.*` methods (M:1), NOT routed through the erased `context.io` (avoids forcing sched.sx into every std consumer + the `_fib_tramp` dup-symbol
> trap); the `Io` protocol's `spawn_raw`/`suspend_raw`/`ready` stay reserved for M:N. Deferred:
> issue 0150 (`Future(void)`/`timeout`); 0156-Part2 (deferred `..` spread); the `::` callable-param
> feature.
Carved from [PLAN-POST-METATYPE.md](PLAN-POST-METATYPE.md) Stream B (§B1) + the
design-of-record [../design/execution-evolution-roadmap.md](../design/execution-evolution-roadmap.md)
§4 (async), §7 steps 49, §8.1 (risks), §10 (testing). Progress in
[CHECKPOINT-FIBERS.md](CHECKPOINT-FIBERS.md). Stream B2 (channels/cancel/stdlib) is a
separate carve ([PLAN-CHANNELS.md], when reached) and depends on this + atomics (✅).
**Goal:** the colorblind, stackful, **pure-sx** async runtime — fibers behind an `Io`
interface, an M:1 scheduler, blocking + deterministic-sim + event-loop `Io` impls. The
**compiler floor is small and net-new**: make `abi(.naked)` actually emit an LLVM `naked`
function (B1.0), and confirm/close the per-fiber `context` root (B1.1). **Everything
else — the context-switch asm, fiber bootstrap, `mmap` stacks, the scheduler, futures,
the `Io` vtables — is ordinary sx library code** (design §4, §4.4). The irreducible FFI
floor: the per-arch asm context-switch (in `.sx`), syscall `extern`s, and `mmap`.
**Cadence (IMPASSIBLE):** no commit both adds a test AND makes it pass (lock-to-bail, then
flip to green); `zig build && zig build test` green after every step; never regen snapshots
while red; scope regens with `-Dname=examples/NNNN-…sx -Dupdate-goldens` + review the diff.
New corpus category: `18xx` concurrency. On an **unrelated** compiler bug → file
`issues/NNNN`, mark this checkpoint BLOCKED, STOP (CLAUDE.md). The in-session
worker-fix override (delegate a blocker to a worker) applies only with explicit user
authorization.
---
## Design (grounded against the tree)
### B1.0 — `abi(.naked)` codegen (the one genuinely net-new compiler piece in B1)
The design doc spells this `callconv(.naked)`; the **real sx surface is `abi(.naked)`**
written in the postfix slot, `name :: (sig) -> Ret abi(.naked) { asm { … }; }` (cf.
`build_options :: () -> BuildOptions abi(.compiler);` in [build.sx:28](../library/modules/build.sx#L28)).
The sx-facing name is **`naked`** throughout (keyword, field `is_naked`, diagnostics) —
matching LLVM's `naked` attribute (the lowering mechanism) and the industry term
(Zig/Rust/GCC/Clang). The ABI variant was renamed `.pure → .naked`: "pure" universally
means *side-effect-free*, the opposite of a register-clobbering context switch.
**Grounding (verified — do not re-derive):**
- The `ABI` enum **already carries `.naked`**`ABI = enum { default, c, compiler, naked }`
([ast.zig:142](../src/ast.zig#L142)), documented "naked function (inline asm
body), no calling-convention prologue/epilogue." So B1.0 is **NOT** "extend the enum."
- `.naked` is **inert today**: [type_resolver.zig:237](../src/ir/type_resolver.zig#L237)
maps `.compiler, .naked → .default` CC, and `emit_llvm` emits **no LLVM `naked`
attribute**. So the net-new work is exactly: **carry `abi == .naked` into the IR
`Function`, emit LLVM's `naked` attr, and skip the implicit-`Context` / prologue
lowering** so the body is just the asm block + its own `ret`.
- The IR `Function` struct ([inst.zig:605](../src/ir/inst.zig#L605)) carries `call_conv`
(default/c) + `is_compiler_domain`, but **no naked flag** — add one (`is_naked: bool`).
- Attribute API is in-tree: `nounwind` is set at
[emit_llvm.zig:1339](../src/ir/emit_llvm.zig#L1339) via
`LLVMGetEnumAttributeKindForName("nounwind", 8)``LLVMCreateEnumAttribute(ctx, id, 0)`
`LLVMAddAttributeAtIndex(func, func_idx_attr /* -1 */, attr)`. The LLVM `naked` attr
is the same shape: `LLVMGetEnumAttributeKindForName("naked", 5)`.
- The `.c` ABI **already skips the implicit ctx** at lowering — `lam.abi == .c` /
`fd.abi == .c` gates (closure.zig:171, [decl.zig:515](../src/ir/lower/decl.zig#L515)).
`.naked` must skip it **too** (a `.naked` fn gets no synthetic `__sx_ctx`, no stack frame,
no prologue — args arrive in ABI registers and are read directly from asm). The
implicit-return machinery (`lowerValueBody`) must also be bypassed: a `.naked` body has no
sx return (the asm rets itself), so lower its statements and cap the block with
`unreachable`.
- **Inline asm already works end-to-end** (lower→emit→JIT): aarch64
([examples/1645](../examples/1645-platform-asm-aarch64-add.sx)), x86_64
([examples/1651](../examples/1651-platform-asm-x86-syscall-write.sx)), global asm, JIT
([1653](../examples/1653-platform-asm-global-jit.sx)). `emitInlineAsm` /
`LLVMGetInlineAsm` at [ops.zig:915](../src/backend/llvm/ops.zig#L915). The `.naked` body
is a single asm block reusing this path.
**`.naked``.c` (design §4.6 context-switch note):** a `.c` epilogue restores SP from the
frame; a context switch deliberately makes SP-in ≠ SP-out, so the `.c` epilogue would
restore from the *wrong* stack. `.naked` = no prologue/epilogue/frame — the asm emits its
own `ret`. This is *why* the switch must be `.naked`, not `.c`.
**Snapshot story (per the atomics precedent):** a `.naked` fn's *body is raw per-arch asm*
(it can't be portable — that's the point), while LLVM's `naked` attribute text is
arch-invariant. **B1.0a** (lock) needs only **one host example** locked to the emit bail —
the bail fires at the function level *before* any asm/instruction selection, so it is
host-independent (no `.build` target pin). **B1.0b** (green) adds emission, pins that
example aarch64 (`.build {"target": "aarch64-macos"}`, end-to-end on a matching host,
ir-only on a mismatch), and adds an x86_64 cross sibling — mirroring the existing asm
corpus split (1645 aarch64 / 1651 x86). The ir-only `.ir` (only producible once emission
lands in B1.0b) asserts the `naked` attribute + the asm body. State loudly: **the `.ir`
proves the `naked` keyword + asm emitted, NOT that any hand-written register save/restore
is correct** — that is the B1.3 switch-stress harness's job, never the corpus's.
### B1.1 — per-fiber `context` root (grounding says this is SMALL, likely library-only)
**Grounding (verified — closes the design doc's open sizing question):**
- `context` is an **implicit `*Context` parameter** (`__sx_ctx`, slot 0), threaded through
every default-conv sx call ([lower.zig:259](../src/ir/lower.zig#L259)) — **not raw TLS**.
Inside a function `current_ctx_ref = Ref.fromIndex(0)` (the param) → it **rides the fiber
stack frame for free**.
- `push Context.{…}` allocates the new `Context` with a **stack `alloca`** and rebinds
`current_ctx_ref` to that slot ([stmt.zig:1263](../src/ir/lower/stmt.zig#L1263)) — "No
global, no walk." So **push frames are fiber-local for free**.
- The **only shared root** is the `__sx_default_context` **global**, bound at
entry-points / `abi(.c)` fns *before any user code runs*
([decl.zig:2667](../src/ir/lower/decl.zig#L2667), :2815).
⇒ The design doc's "lower as swappable indirection, never raw TLS" guards a **non-problem**
(confirmed). The **real, now-sized** B1.1 work is purely a **library convention**: a
freshly-`spawn`ed fiber must take its root `Context` from the **spawner's snapshot** (passed
as the fiber-entry fn's `__sx_ctx` slot-0 arg by the spawn trampoline), **not** the
`__sx_default_context` global. That is sx-side (the trampoline already controls slot 0) —
**expected to be ZERO compiler change.** B1.1's first action is a probe confirming this; if
a fiber genuinely re-reads the global root mid-stack (it should not — entry binds once),
*then* and only then is there a compiler obligation. **Ground the probe before sizing any
compiler work.** Prerequisite of B1.3 (a fiber needs a valid root before it switches).
### B1.2B1.5 — pure sx over the primitives (design §4)
- **B1.2 (A1):** `Io` interface + `context.io` + `Future` + `cancel()` — a protocol/vtable
threaded exactly like `Allocator` (which already lives at `Context` field 0; see
`allocViaContext` [call.zig:1214](../src/ir/lower/call.zig#L1214)). `Io` becomes another
`Context` field. No compiler change — protocols + context already carry it.
- **B1.3 (A2):** the fiber runtime — naked context-switch asm (per-arch), bootstrap, `mmap`
stacks **with mandatory guard pages**. All sx. **Highest corruption risk in the stream**
(§8.1.1) and **untestable by the deterministic `Io`** (which tests *scheduling*, not the
*switch*). Its **first deliverable, before the scheduler AND the deterministic `Io`**: a
standalone **2-fiber ping-pong switch-stress harness** (§10.7) — scribble every
callee-saved register + a stack canary before each suspend, deep/recursive chains, verify
all survive post-resume. This harness — not B1.4 — is A2's correctness gate.
- **B1.4 (A3):** `Io` impls in order **blocking → deterministic-sim (KEYSTONE) → event-loop**
(kqueue/epoll/io_uring). Build the deterministic `Io` right after blocking; **calibrate it
against blocking `Io`** before trusting it to gate everything async (§8.1.3, §10.7) — a
deterministic-but-wrong scheduler snapshots garbage. (Open, deferred: the event loop does
**not** yet cooperate with a platform UI run loop — CFRunLoop/ALooper; that's a §6
app-target gap, out of B1.)
- **B1.5 (A5·M:1):** the single-thread scheduler — validates the whole colorblind stack
end-to-end. `18xx` corpus runs under the deterministic `Io`, asserting a **program-emitted
ordering contract** (sequence markers), not raw interleaving, so scheduler-policy tweaks
don't churn every snapshot.
### Files the compiler floor touches (B1.0 only; B1.1B1.5 are library + tests)
B1.0 (`.naked`) forces these plumbing sites:
- [ast.zig:142](../src/ast.zig#L142) — `ABI.naked` (exists; reference only).
- [inst.zig:605](../src/ir/inst.zig#L605) — add `is_naked: bool = false` to `Function`.
- [decl.zig](../src/ir/lower/decl.zig) — set `is_naked` from `fd.abi == .naked`; gate the
implicit-ctx off for `.naked` in `funcWantsImplicitCtx` (mirror the `.c` skip at
decl.zig:515) and bypass `lowerValueBody` for `.naked` bodies (lower statements + cap with
`unreachable`, in both body-lowering paths) — a `.naked` fn binds no ctx and has no sx
return.
- [type_resolver.zig:237](../src/ir/type_resolver.zig#L237) — leave CC `.default` (a `.naked`
fn-pointer type has no CC of its own; nakedness is a decl-level emit attribute).
- [emit_llvm.zig:402](../src/ir/emit_llvm.zig#L402) Pass 2 — **B1.0a:** bail loudly when
`func.is_naked` (build-gating). **B1.0b:** instead emit LLVM's `naked` attr (shape per
`nounwind` at emit_llvm.zig:1339) + the asm-only body (no prologue).
- Any `.op`/`Function`-field switch the Zig build flags — let the build tell you.
---
## Phases (xfail→green steps)
### B1.0 — `abi(.naked)` codegen — ✅ COMPLETE
- **B1.0a (lock) — ✅ DONE.** Carried `abi == .naked` into IR `Function.is_naked`; threaded
through `decl.zig` (`funcWantsImplicitCtx` skips `.naked` like `.c`; all body-lowering paths
bypass `lowerValueBody` for `.naked`, lowering the asm body + capping with `unreachable`) +
generic.zig + pack.zig; `emit_llvm` Pass 2 bailed loudly on `func.is_naked`. Locked by
`examples/1800-concurrency-naked-asm.sx` + the generic regression (review-found gap).
- **B1.0b (green) — ✅ DONE.** `emit_llvm` declaration pass adds LLVM `naked` + `noinline` +
`nounwind` for `func.is_naked` and skips `frame-pointer=all` (incompatible with a frameless
function); Pass 2 emits the body normally (`naked` ⇒ verbatim asm + own `ret`, no
prologue). `1800` pinned aarch64 → exit 42 + `.ir`; `1801-concurrency-naked-generic.sx`
(renamed from `-bail`) proves the generic path emits a naked body (exit 42);
`1802-concurrency-naked-asm-x86.sx` x86_64 cross sibling (ir-only here, `.ir` locks `naked`
+ `movl $42, %eax`). Unit test `emit: abi(.naked) function gets the naked attribute` asserts
`naked` present + `frame-pointer` absent. Suite green (724/0).
- **B1.0c (review-hardening) — ✅ DONE.** A param-bearing `.naked` fn emitted invalid LLVM
(loud verifier error). Gated the param-alloca loop on `fd.abi != .naked` (decl.zig both
paths + generic.zig) so a naked fn's args stay in registers (read by the asm body) — this
*enables* B1.3's `swap_context(from, to)`. Locked by `1803-concurrency-naked-asm-param.sx`.
Pack `.naked` (variadic + naked, nonsensical) left unsupported → loud verifier error.
### B1.1 — per-fiber `context` root — ✅ COMPLETE (zero compiler change)
Probe confirmed the spawn convention works with ordinary language features: snapshot
`context` (`snap := context`), store it in a struct, and `push f.root { entry(args) }` from a
trampoline running under a different ambient context — the body reads the snapshot (via the
implicit slot-0 `*Context` param), not the ambient ctx, and `push` restores ambient on exit.
No path re-reads `__sx_default_context` mid-stack ⇒ **no compiler obligation**; this is a pure
library convention. Locked by `examples/1804-concurrency-context-snapshot.sx` (`fiber root:
42` / `ambient after: 99`). The design doc's "never raw TLS" guarded a non-problem.
### B1.2 — A1: `Io` interface + `context.io` + `Future` + `cancel()` API
Library-only. `Io` as a protocol added to `Context` (mirror `Allocator`). `Future`/`cancel`
API surface. xfail→green via an `18xx` example exercising the blocking `Io` default (real
suspend lands in B1.3). No compiler change expected; if a protocol-in-context gap appears,
file it.
### B1.3 — A2: fiber runtime (naked switch + bootstrap + guarded `mmap` stacks) — ✅ COMPLETE
- **B1.3a (switch-stress harness FIRST) — ✅** the §10.7 register/canary-survival gate (1807/1808),
validity proven by negative controls, adversarially reviewed.
- **B1.3b — ✅** fiber bootstrap + guarded `mmap` stacks (1809); the x86_64 sibling landed as Win64
on a real VM (1810, `0 0 P`). Switch proven on TWO arch/ABI pairs.
### B1.5a — M:1 scheduler CORE (`std/sched.sx`) — ✅ COMPLETE
The reusable scheduler wrapping `swap_context`: generic `Fiber`/`Scheduler`,
`spawn`/`yield_now`/`suspend_self`/`wake`/`run` over guarded `mmap` stacks, one generic
`fib_dispatch` running any stored closure body. Adversarially reviewed + hardened; fixed blocker
bug 0154 (struct-field `null`/`---` over-store) en route. Locked by `1811` (round-robin) + `1812`
(suspend/wake). Built BEFORE the deterministic `Io` because FiberIo (B1.4a) needs it as substrate.
### B1.4a — suspending fiber-task async (`sched.go`/`wait`/`cancel`) — ✅ COMPLETE
`Task($R)` + `Scheduler.go(work) -> *Task($R)` + `wait`/`cancel` in `sched.sx` (nullary-thunk;
self-contained). `go` spawns `work` as a fiber, `wait` parks the caller until it completes. Locked
by `1813`. Two compiler blockers fixed (0156-Part1, 0157) + adversarially reviewed/hardened.
### B1.4b/c — A3: `Io` impls (deterministic-sim KEYSTONE → event-loop)
Blocking exists (io.sx `CBlockingIo`). Next the deterministic-sim `Io`, **calibrated against
blocking** before any `18xx` test trusts it; then the event loop. The deterministic `Io` is the
test harness for *all* of B1.5 + Stream B2.
### B1.5 — A5: M:1 scheduler
End-to-end validation of the colorblind stack. `18xx` corpus under the deterministic `Io`,
asserting program-emitted ordering contracts.
---
## Gates
- **B1.0:** unit `emit_llvm.test.zig` (the `naked` attr present on a `.naked` fn); two
arch-gated examples (aarch64 + x86_64) run end-to-end on a matching host, ir-only on a
mismatch (assert `naked` + asm in `.ir`). **OUT of corpus scope, stated loudly:** the
*correctness* of any hand-written register save/restore — that's the B1.3 stress harness.
- **B1.1:** an `18xx` example locking context-carried-by-slot-0 behavior + a checkpoint note
on the spawn-trampoline convention.
- **B1.3:** the **switch-stress harness is A2's gate** (register/canary survival — §10.7),
NOT a run/snapshot test; plus arch-gated run tests.
- **B1.4:** deterministic `Io` **calibrated** against blocking `Io` (§8.1.3) before trusting
it; `18xx` under the deterministic `Io`.
- **B1.5:** `18xx` ordering-contract snapshots under the deterministic `Io`.
## Kickoff prompt (B1.0b — paste into a fresh session)
> Implement Stream B1 step **B1.0b** (`abi(.naked)` real emission) per
> `current/PLAN-FIBERS.md`. Verify `zig build && zig build test` is green first (B1.0a is
> already landed: `Function.is_naked` plumbed, `decl.zig` skips ctx + bypasses implicit-return
> for `.naked`, `emit_llvm` Pass 2 bails loudly, `examples/1800-concurrency-naked-asm.sx`
> locked to the bail). Then: (1) in `src/ir/emit_llvm.zig` Pass 2 (~line 402), REPLACE the
> `func.is_naked` bail with real emission — set LLVM's `naked` attribute on the function
> (`LLVMGetEnumAttributeKindForName("naked", 5)` → `LLVMCreateEnumAttribute(ctx, id, 0)` →
> `LLVMAddAttributeAtIndex(llvm_func, -1, attr)`; shape per the `nounwind` set at
> emit_llvm.zig:1339) and emit the `.naked` body as its asm block only, no prologue/epilogue
> (the body already lowers to the inline-asm op + an `unreachable` terminator). (2) Pin
> `examples/1800-concurrency-naked-asm.sx` aarch64 with a `.build` sidecar
> `{"target":"aarch64-macos"}`; on this aarch64 host it runs end-to-end (exit 42), capture
> `.ir` + regen (`-Dname=examples/1800-concurrency-naked-asm.sx -Dupdate-goldens`), review the
> diff (assert the `.ir` shows the `naked` attr + `mov x0, #42` / `ret`, NO stray error
> text). (3) Add `examples/1802-concurrency-naked-asm-x86.sx` (x86_64 body, `.build
> {"target":"x86_64-linux"}`, ir-only on this host — requires its `.ir`, now producible).
> (4) Add a unit test in `src/ir/emit_llvm.test.zig` asserting the `naked` attribute is
> present on an `abi(.naked)` function. Confirm `zig build test` green, commit. NOTE: the
> `.ir` proves the keyword + asm emitted, NOT register-save correctness (that's the B1.3
> switch-stress harness). If you hit an UNRELATED compiler bug, file `issues/NNNN`, mark
> `CHECKPOINT-FIBERS.md` BLOCKED, and STOP.