Port library/modules/std/sched.sx to run on aarch64-linux alongside aarch64-macOS, validated byte-identical on both via Apple `container`. Per-OS bits are comptime-branched: - MAP_AP (mmap MAP_ANON flag): linux 0x22 / macOS 0x1002. - fd-readiness backend: epoll on linux, kqueue on darwin (epoll import scoped to the linux branch). block_on_fd, the run-loop Mode-2 drain, and cancel_io_waiter_for each branch; the epoll paths EPOLL_CTL_DEL on fire and on early-wake (EPOLLONESHOT only disables a registration; kqueue EV_ONESHOT auto-removes it). - first-entry trampoline: a per-OS hand-written global-asm symbol becomes a naked sx fn fib_tramp (mov x0,x19; br x20) + register-indirect dispatch (spawn presets regs[1] == x20 == &fib_dispatch), dropping the per-OS .global symbol entirely. Fixes issue 0193 Bug A: the trampoline redesign bus-errored on the go/wait/sleep capstone (1817) because fib_dispatch was not pinned to the C ABI. Without an explicit ABI, fib_dispatch uses sx's internal calling convention (x0 = implicit context, first arg self shifted to x1) while the trampoline hands self over in x0 (C-ABI); on first entry the body runs (x1 happens to alias self) but the closure then loads regs[1] == &fib_dispatch as its first capture and re-invokes fib_dispatch forever -> stack overflow -> bus error. Annotating fib_dispatch `abi(.c)` pins it to the C-ABI (self in x0), matching the trampoline. `abi(.c)` rather than `export` because the fn is reached only by address through the trampoline, never by an external name -- so it needs the convention, not a public symbol (it stays a local symbol). Root cause found via lldb on an AOT build; confirmed against the compiler source. Bug B (a top-level asm block wrapped in inline-if is dropped during the comptime-conditional flatten) is carved out to issue 0194 (OPEN) -- no live trigger remains, since the naked-fn trampoline sidesteps it. 1811/1814/1816/1817 run byte-identical on the aarch64-macOS host and in an aarch64-linux container; full suite green (817/0). Documents the fiber runtime in readme.md.
7.0 KiB
Issue 0193 — linux fiber-runtime port (sched.sx) + a wrapped top-level asm drop
RESOLVED — port landed on aarch64-linux.
Bug A (register-indirect trampoline bus-errors on 1817): FIXED. Root cause found via lldb on an AOT macOS build (the bug reproduced on macOS too, so no container needed): the WIP port had left
fib_dispatchwith no explicit ABI annotation (the original pinned the C-ABI viaexport "fib_dispatch", which the redesign dropped). Without a C-ABI pin the fn uses sx's INTERNAL calling convention, which reserves x0 for the implicitcontextpointer and shifts the first real argselfto x1 — but the trampoline (mov x0, x19; br x20) hands the fiber over in x0, C-ABI style. On first entry x1 coincidentally aliases&fiber.ctx == self(left there by the scheduler's priorswap_context(from, to), x1 = to), so the body runs once; but inside it the closure loads[Fiber+8] == ctx.regs[1] == &fib_dispatchas its "first capture" and re-invokesfib_dispatchforever → stack overflow → bus error. Fix: annotatefib_dispatchabi(.c)so it keeps the C-ABI (selfin x0), matching what the trampoline supplies — a one-line library change, no compiler change.abi(.c)is used rather thanexport "fib_dispatch"because the fn is reached only by address through the trampoline (xx fib_dispatch), never by an external name, so it needs the convention, not a public symbol (it stays a local symbol). The register-indirect naked-fn trampoline design is kept (it sidesteps Bug B's hand-written per-OS global-asm symbol). Adversarially reviewed against the compiler source (src/ir/lower/decl.zigfuncWantsImplicitCtx/wants_ctx/CallingConvention.c); root cause + fix confirmed CORRECT.Validation: 1811 / 1814 / 1816 / 1817 (the go/wait/sleep capstone) all run byte-identical on the aarch64-macOS host AND in an aarch64-linux Apple
container(sum: 123, completion order2@10 3@20 1@30, etc.). Fullzig build testmacOS suite GREEN (817/0).Bug B (wrapped top-level
asmdropped): carved out toissues/0194-wrapped-toplevel-asm-dropped.mdas an OPEN compiler bug. It is no longer triggered anywhere in the tree (the port no longer uses a wrapped global-asm block), so it does not block anything — but it is a real defect and stays filed.Original writeup below for history.
Status: (historical — see RESOLVED banner above). Two intertwined items uncovered while porting
library/modules/std/sched.sx (the M:1 fiber runtime) to aarch64-linux.
The epoll bindings + std.event.Loop epoll backend are already committed (cc137002) and
runtime-validated on real Linux via Apple container (see the event.sx VALIDATION note / the
apple-container-linux-testing memory). This issue is only about the fiber scheduler port.
What WORKS (validated on aarch64-linux in an Apple container)
With the stashed sched.sx port, built --target aarch64-linux --self-contained and run in an
alpine container:
- 1811 (scheduler round-robin via
yield_now):sequence: 0 1 2 0 1 2 0 1 2, all done. ✓ - 1816 (
block_on_fdover a pipe — the epoll fd path):log: wrote read 3 [97 98 99],n_suspended: 0— identical to macOS kqueue. ✓ - macOS (kqueue) stays green for both.
The port (all in sched.sx) is: MAP_AP 0x1002→0x22; an inline if OS == .linux { ep :: #import "modules/std/net/epoll.sx" }; and inline if OS == { case .linux: <epoll> case .macos: <kqueue> }
branches in block_on_fd (open + EPOLLIN|EPOLLONESHOT register), the run-loop Mode-2
(epoll_wait + EPOLL_CTL_DEL-on-fire for one-shot parity), and cancel_io_waiter_for
(EPOLL_CTL_DEL-on-early-wake). Those epoll branches are correct (1816 proves it).
Bug A — register-indirect trampoline bus-errors on the go/wait/sleep capstone (1817)
To get the fiber trampoline onto linux without a per-OS hand-written global-asm symbol
(_fib_tramp vs fib_tramp), the stash replaces the global asm trampoline with a **naked sx fn
- register-indirect branch**:
spawnpresetsregs[1](x20) =xx fib_dispatch, andfib_tramp :: () abi(.naked) { asm { mov x0, x19 ; br x20 } }tail-branches to dispatch. Its own symbol is auto-emitted per-OS, so no.global/bl <name>literal.
This works for 1811 + 1816 (both run on linux AND macOS) but bus-errors immediately on 1817
(go/wait/sleep) on BOTH macOS and linux — Bus error, no output, a short recursive-looking
stack trace. HEAD's 1817 (committed global-asm trampoline) works (sum: 123), so the redesign is
the regression. Root cause not yet found: 1811/1816 use the same spawn/tramp path; the only thing
1817 adds is timer sleep + Task go/wait (suspend/resume). Suspect something about the
naked-fn tramp or x20 liveness specific to the Task-closure / resume path — needs a debugger on the
container build.
Bug B — a top-level asm block wrapped in an inline if is DROPPED (in sched.sx's context)
The redesign in Bug A was forced by this: wrapping the original global asm trampoline in
inline if OS == { case .linux: asm{…fib_tramp…} case .macos: asm{…_fib_tramp…} } (or the plain
inline if OS == .linux { asm } else { asm } form) makes the asm not emit at all — nm shows
fib_tramp as U (undefined), both platforms fail to link. A PLAIN unwrapped asm{} emits fine.
NOT reproducible in isolation: minimal/medium repros (top-level asm in a case; two case blocks;
case asm in an imported module; naked fn + case asm with bl to an exported fn; a one-sided
inline if .linux { #import } before it) ALL emit + link correctly. Only sched.sx (the full module)
drops it. So there's a real flatten/lowering interaction in src/imports.zig
flattenComptimeConditionals / appendBranchDecls (the comptime-conditional pre-pass that surfaces
top-level decls from a taken if_expr/match_expr arm) with a top-level asm node, triggered by
something else in sched.sx — not yet isolated.
Two paths to resolve (either suffices)
- Path A (compiler): fix Bug B — make a top-level
asmblock surviveinline if/caseflattening in all module contexts. Then the original global-asm trampoline can be OS-branched with thecaseform directly (no tramp redesign), sidestepping Bug A entirely. This is what the user asked for ("case form to emit top-level asm block"). Start: instrumentflattenComptimeConditionalsto dump the surfaced top-level decls for sched.sx and see where theasmnode is lost. - Path B (library): fix Bug A — debug the register-indirect tramp's 1817 bus error (gdb/lldb in the container on the aarch64-linux build, or a reduced go/wait/sleep repro). No compiler change.
Verification
git stash pop; then per-example: sx build --target aarch64-linux --self-contained -o /tmp/x examples/concurrency/<ex>.sx and container run --rm -v "$PWD/.sx-tmp:/work" alpine /work/x
(see apple-container-linux-testing). Target: 1811/1814/1816/1817 all green on linux AND macOS,
plus the full zig build test macOS suite.