feat: linux epoll backend for std.event.Loop (the kqueue twin)

Add library/modules/std/net/epoll.sx — raw epoll bindings, the linux twin of
std/net/kqueue.sx — and branch std.event.Loop on `inline if OS` so the
OS-neutral readiness Loop runs on linux (epoll) as well as darwin (kqueue);
callers never see the backend.

epoll_event has no packed-struct primitive in sx, so it is modelled as an
arch-branched struct of u32 fields — { events, data_lo, data_hi } → 12 bytes on
x86_64 (matching __attribute__((packed))), { events, pad, data_lo, data_hi } →
16 bytes on aarch64 — every field 4-aligned, so the layout is byte-exact for the
kernel ABI with no packed attribute and no unaligned access. The fd is stashed
in data_lo (epoll echoes one data word, not the fd separately).

epoll.sx is self-contained (libc only, no build.sx): the `inline if ARCH`
selecting the struct is resolved by the compiler's flatten pre-pass, so the
module's IR stays small. The epoll backend is imported INSIDE event.sx's
`inline if OS == .linux` branch (not top level): event.sx rides the std.sx
barrel, so a top-level import would register epoll's types into every std
program's type table on darwin and drift every .ir snapshot.

The epoll Loop keeps a small per-fd registration table (combined EPOLLIN/OUT
mask via EPOLL_CTL_ADD/MOD/DEL), maps the fd back to the caller's udata, arms
EPOLLRDHUP so a peer half-close surfaces as Event.eof (matching kqueue EV_EOF),
and uses an eventfd as the cross-thread wake channel (kqueue's EVFILT_USER).

Validation: the kqueue path runs end-to-end on the macOS host (1632 unchanged);
the epoll bindings + ABI layout are corpus-locked ir-only by
examples/event/1633 (x86_64-linux, both arches probe-verified). The epoll Loop
is verified to lower clean for both linux arches and self-reviewed, but is not
corpus-snapshotted (a Loop example drags the std barrel → ~18k-line brittle IR);
runtime behavior validates on a linux runner.
This commit is contained in:
agra
2026-06-26 08:37:12 +03:00
parent 501399b1a9
commit cc13700237
8 changed files with 647 additions and 8 deletions

View File

@@ -6,24 +6,51 @@
// registrations cost nothing — the substrate an httpz-shaped server
// worker stands on.
//
// Backend: kqueue (std/net/kqueue) on darwin. The epoll twin
// (std/net/epoll, PLAN-HTTPZ S4) slots in behind this same surface
// when the linux target lands; callers never see the backend.
// Backend: kqueue (std/net/kqueue) on darwin, epoll (std/net/epoll) on
// linux. The whole `Loop` struct is selected per-OS by `inline if OS`
// (the compiler's flatten pre-pass picks the matching top-level decl) —
// callers never see the backend. The two backends differ enough in state
// that they are separate structs rather than one struct with conditional
// fields (sx has no conditional struct fields): kqueue carries only its
// queue fd, while epoll keeps a small per-fd registration table (it has
// ONE registration per fd with a combined interest mask, and its event
// echoes back only a single `data` word — we stash the fd there and the
// table maps fd → the caller's udata).
//
// Interest is per direction: read and write are registered and removed
// independently (mirroring kqueue filters; the epoll backend will
// compose its event mask internally). The typical server pattern:
// read interest for a connection's whole life, write interest only
// while a partial response is pending.
// independently. On kqueue these are independent EVFILT_* filters; on
// epoll the Loop composes the combined EPOLLIN/EPOLLOUT mask internally
// and issues EPOLL_CTL_ADD/MOD/DEL. The typical server pattern: read
// interest for a connection's whole life, write interest only while a
// partial response is pending.
//
// Deadlines: the loop deliberately has no timer registrations —
// httpz-style timeout bookkeeping (request/keepalive eviction) is
// deadline math the caller does with `deadline_in`/`expired` between
// waits, passing the nearest deadline as `wait`'s timeout.
//
// VALIDATION: the kqueue path runs end-to-end on the macOS dev host
// (examples/event/1632 — which exercises the full facade surface:
// add_read/write, add_wake/wake, wait, del_*, EOF). The epoll path has no
// linux box here, so it is verified to LOWER clean for x86_64-linux and
// aarch64-linux (the whole module + every epoll syscall emits) and is
// self-reviewed; it is NOT corpus-snapshotted (a Loop example pulls in the
// std barrel → an ~18k-line IR dump that would churn on any unrelated std
// change — worse than the gap). The epoll ABI itself (the layout-sensitive
// part) IS corpus-locked, by examples/event/1633 over the raw bindings.
// Runtime behavior validates on a linux runner.
#import "modules/std.sx";
kqb :: #import "modules/std/net/kqueue.sx";
timp :: #import "modules/std/time.sx";
// NOTE: the epoll backend is imported INSIDE the `inline if OS == .linux`
// branch below, never at top level. event.sx rides the std.sx barrel, so a
// top-level `#import "epoll.sx"` would register epoll's types into EVERY std
// program's type table on darwin too — drifting every `.ir` snapshot. Scoping
// the import to the linux branch keeps darwin's type graph unchanged. (kqb
// stays top-level: it was already there before the epoll split, so darwin's
// table — and the snapshots — match; on linux its kqueue externs are unused
// declares.)
EventErr :: error {
Init, // the kernel queue could not be created
@@ -36,7 +63,8 @@ EventErr :: error {
// eof — the peer finished writing (drain pending bytes, then close);
// err — the registration itself failed asynchronously;
// user — a cross-thread wake() (see add_wake), no fd attached;
// nbytes — bytes readable / writable-buffer space (backend estimate);
// nbytes — bytes readable / writable-buffer space (backend estimate;
// kqueue reports it, epoll does not → 0 on linux);
// udata — the word given at registration, verbatim.
Event :: struct {
fd: i32 = -1;
@@ -49,6 +77,175 @@ Event :: struct {
nbytes: i64 = 0;
}
inline if OS == .linux {
ep :: #import "modules/std/net/epoll.sx";
// ── epoll backend (linux) ──────────────────────────────────────────────
// epoll reports a single 64-bit `data` per event and carries ONE
// registration per fd, so the Loop keeps a tiny table: each `Reg` records
// the fd's current combined interest mask and the caller's udata. The fd
// itself is stashed in epoll's `data` (so `epoll_wait` reports which fd
// fired); the table recovers the udata and lets add/del compose the mask
// into an EPOLL_CTL_ADD / MOD / DEL.
//
// One semantic difference from the kqueue backend: epoll has a SINGLE
// udata per fd (not per direction), so registering read and write on the
// same fd with different udata words keeps the most recent — a readable
// and a writable event on that fd then report the same udata. Callers key
// udata on the fd/connection (the universal pattern), so this is
// invisible in practice; pass the same udata for both directions of a fd.
Reg :: struct {
fd: i32 = -1;
mask: u32 = 0;
udata: usize = 0;
}
Loop :: struct {
epfd: i32 = -1;
wake_fd: i32 = -1; // eventfd, lazily created by add_wake
wake_udata: usize = 0;
regs: List(Reg);
// The Loop outlives the caller's current `context.allocator` scope, so
// capture the owning allocator at init and grow `regs` through it (the
// long-lived-container rule).
own: Allocator;
init :: () -> Loop !EventErr {
e := ep.ep_create();
if e < 0 { raise error.Init; }
return Loop.{ epfd = e, regs = .{}, own = context.allocator };
}
close :: (self: *Loop) {
if self.epfd >= 0 { socket.close(self.epfd); }
if self.wake_fd >= 0 { socket.close(self.wake_fd); }
self.regs.deinit(self.own);
self.epfd = -1;
self.wake_fd = -1;
}
// Index of the registration for `fd`, or -1. Linear scan — fd counts in
// the M:1 / per-worker model are small (mirrors the scheduler's waiter
// lists).
reg_index :: (self: *Loop, fd: i32) -> i64 {
i := 0;
while i < self.regs.len {
if self.regs.items[i].fd == fd { return i; }
i += 1;
}
return -1;
}
// Drive `fd`'s registration to interest `mask`: ADD a new fd, MOD an
// existing one, or DEL (and forget) when the mask drops to zero. The
// table is kept in lockstep with the kernel. True on success.
apply_mask :: (self: *Loop, fd: i32, mask: u32, udata: usize) -> bool {
idx := self.reg_index(fd);
if mask == 0 {
if idx < 0 { return true; }
ok := ep.ep_ctl(self.epfd, ep.EPOLL_CTL_DEL, fd, 0);
// swap-remove the forgotten reg (order is irrelevant).
self.regs.items[idx] = self.regs.items[self.regs.len - 1];
self.regs.len = self.regs.len - 1;
return ok;
}
if idx >= 0 {
self.regs.items[idx].mask = mask;
self.regs.items[idx].udata = udata;
return ep.ep_ctl(self.epfd, ep.EPOLL_CTL_MOD, fd, mask);
}
self.regs.append(Reg.{ fd = fd, mask = mask, udata = udata }, self.own);
return ep.ep_ctl(self.epfd, ep.EPOLL_CTL_ADD, fd, mask);
}
// Read interest also arms EPOLLRDHUP so a peer half-close surfaces as
// `Event.eof` — matching kqueue's EV_EOF, which comes for free.
add_read :: (self: *Loop, fd: i32, udata: usize) -> !EventErr {
idx := self.reg_index(fd);
mask := ep.EPOLLIN | ep.EPOLLRDHUP;
if idx >= 0 { mask = self.regs.items[idx].mask | ep.EPOLLIN | ep.EPOLLRDHUP; }
if !self.apply_mask(fd, mask, udata) { raise error.Register; }
return;
}
del_read :: (self: *Loop, fd: i32) {
idx := self.reg_index(fd);
if idx < 0 { return; }
mask := self.regs.items[idx].mask & ~(ep.EPOLLIN | ep.EPOLLRDHUP);
self.apply_mask(fd, mask, self.regs.items[idx].udata);
}
add_write :: (self: *Loop, fd: i32, udata: usize) -> !EventErr {
idx := self.reg_index(fd);
mask := ep.EPOLLOUT;
if idx >= 0 { mask = self.regs.items[idx].mask | ep.EPOLLOUT; }
if !self.apply_mask(fd, mask, udata) { raise error.Register; }
return;
}
del_write :: (self: *Loop, fd: i32) {
idx := self.reg_index(fd);
if idx < 0 { return; }
mask := self.regs.items[idx].mask & ~ep.EPOLLOUT;
self.apply_mask(fd, mask, self.regs.items[idx].udata);
}
// The loop's wake channel: an eventfd registered for EPOLLIN. wake()
// from any thread writes the 8-byte counter, making wait() return an
// Event carrying `udata` with `.user` set. (kqueue uses EVFILT_USER;
// epoll's idiom is eventfd.) One registration serves the Loop's life.
add_wake :: (self: *Loop, udata: usize) -> !EventErr {
if self.wake_fd < 0 {
self.wake_fd = ep.eventfd(0, ep.EFD_CLOEXEC | ep.EFD_NONBLOCK);
if self.wake_fd < 0 { raise error.Register; }
}
self.wake_udata = udata;
if !ep.ep_ctl(self.epfd, ep.EPOLL_CTL_ADD, self.wake_fd, ep.EPOLLIN) { raise error.Register; }
return;
}
// Thread-safe: writing the eventfd counter is atomic.
wake :: (self: *Loop) {
if self.wake_fd < 0 { return; }
one : u64 = 1;
socket.write(self.wake_fd, xx @one, 8);
}
// Fill `out` with ready events, waiting at most `timeout_ms`
// (negative = forever). Returns the count; 0 is a timeout.
wait :: (self: *Loop, out: []Event, timeout_ms: i64) -> i64 !EventErr {
raw : [64]ep.EpollEvent = ---;
cap : i64 = 64;
if xx out.len < cap { cap = xx out.len; }
n := ep.ep_wait(self.epfd, .{ ptr = @raw[0], len = cap }, xx cap, xx timeout_ms);
if n < 0 { raise error.Wait; }
i := 0;
while i < n {
evr := raw[i];
fd := ep.ev_fd(evr);
e : Event = .{ fd = fd };
if self.wake_fd >= 0 and fd == self.wake_fd {
// Drain the eventfd counter so it doesn't re-fire immediately.
drain : u64 = 0;
socket.read(self.wake_fd, xx @drain, 8);
e.user = true;
e.udata = self.wake_udata;
} else {
idx := self.reg_index(fd);
if idx >= 0 { e.udata = self.regs.items[idx].udata; }
if ep.ev_readable(evr) { e.readable = true; }
if ep.ev_writable(evr) { e.writable = true; }
if ep.ev_eof(evr) { e.eof = true; }
if ep.ev_err(evr) { e.err = true; }
}
out[i] = e;
i += 1;
}
return xx n;
}
}
} else {
// ── kqueue backend (darwin) ────────────────────────────────────────────
Loop :: struct {
kq: i32 = -1;
@@ -118,7 +315,10 @@ Loop :: struct {
}
}
}
// ── deadline helpers (monotonic, std.time) ───────────────────────────
// Backend-independent — shared by both Loop variants.
// The absolute monotonic instant `ms` from now.
deadline_in :: (ms: i64) -> i64 {

View File

@@ -0,0 +1,140 @@
// std/net/epoll — raw epoll bindings: the linux twin of std/net/kqueue.
// linux-only by definition; the OS-neutral Loop facade over both backends is
// std.event. Import this module explicitly — like its kqueue sibling it
// deliberately does not ride the std.sx barrel.
//
// One epoll instance multiplexes readiness for any number of fds: a registered
// fd reports through `epoll_wait` when its interest mask (EPOLLIN / EPOLLOUT)
// fires, and an idle registration costs nothing — the head-of-line-free
// substrate the event Loop and an httpz-shaped server worker stand on.
//
// ── How this differs from kqueue (and why the surface is shaped this way) ──
// - ONE registration per fd carries a combined events MASK; changing the mask
// is EPOLL_CTL_MOD, not a second EVFILT_* add. The Loop (std.event) tracks
// the per-fd mask and feeds the full mask on each change.
// - `epoll_event` echoes back a single 64-bit `data` word, NOT the fd in a
// separate field the way kqueue's `ident` is the fd. We stash the fd in the
// low 32 bits of `data` (`data_lo`) so `epoll_wait` reports which fd fired;
// a caller wanting a wider udata keeps its own fd→udata map.
// - EOF is EPOLLHUP / EPOLLRDHUP flags on a readable event, not kqueue's
// EV_EOF; an async registration error is EPOLLERR.
//
// ── struct epoll_event layout (the one real ABI landmine) ──────────────────
// struct epoll_event { uint32_t events; epoll_data_t data; }; // data is a
// union { void* ptr; int fd; uint32_t u32; uint64_t u64; } (8 bytes).
// On x86_64 the struct is __attribute__((packed)) → 12 bytes, `data` at
// offset 4. On every other arch (aarch64) it is naturally aligned → 16 bytes,
// `data` at offset 8. sx has no packed-struct primitive, so we model the
// 8-byte `data` union as two u32 halves and let the field layout fall out per
// arch:
// x86_64 : { events@0, data_lo@4, data_hi@8 } → 12 bytes
// aarch64: { events@0, pad@4, data_lo@8, data_hi@12 } → 16 bytes
// Every field is a u32 at a 4-aligned offset, so no packed attribute and no
// unaligned 8-byte access is ever needed — yet `size_of(EpollEvent)` and the
// `[N]EpollEvent` stride come out byte-exact for the kernel ABI on both
// arches, and `epoll_wait` can fill a plain `[]EpollEvent` directly. (Both
// arches are little-endian, so the fd — an `int` in the union — is the low
// word, `data_lo`.) This struct-per-arch shape was chosen over raw byte-offset
// poking deliberately: idiomatic field reads, no scalar-pointer indexing
// (issue 0155), no unaligned u64.
//
// VALIDATION NOTE: the dev host is aarch64-macOS — there is no linux box to run
// this against, so this module is currently IR-only verified: the arch-correct
// layout (12-byte / 16-byte stride, fd offset) surfaces as the struct shape in
// `sx ir --target *-linux`, and the whole module lowers clean. Runtime
// correctness (syscall behavior, the kernel-filled event array, EPOLLRDHUP
// semantics) validates end-to-end only on a linux runner — mirror of how the
// Win64 switch was IR-only until a Windows VM appeared (CHECKPOINT-FIBERS
// B1.3b-1).
//
// No `#import "modules/build.sx"` despite the `inline if ARCH` below: a
// top-level `inline if OS/ARCH/POINTER_SIZE` conditional is resolved by the
// compiler's flatten pre-pass (imports.zig — name-matched against the target),
// NOT by reading build.sx's `ARCH` global as a value. Skipping the import keeps
// this module's IR self-contained (libc only) — no std/compiler/bundle baggage.
libc :: #library "c";
// struct epoll_event, arch-exact (see the header). Both variants expose the
// same three load-bearing fields — `events`, `data_lo` (the fd), `data_hi` — so
// consumer code is arch-agnostic; the aarch64 `pad` is never touched.
inline if ARCH == .x86_64 {
EpollEvent :: struct {
events: u32 = 0;
data_lo: u32 = 0; // the fd (union's low 32 bits)
data_hi: u32 = 0;
}
} else {
EpollEvent :: struct {
events: u32 = 0;
pad: u32 = 0; // alignment pad before the 8-aligned data union
data_lo: u32 = 0; // the fd (union's low 32 bits)
data_hi: u32 = 0;
}
}
// ── interest mask (events) ─────────────────────────────────────────────────
EPOLLIN :u32: 0x001;
EPOLLPRI :u32: 0x002;
EPOLLOUT :u32: 0x004;
EPOLLERR :u32: 0x008;
EPOLLHUP :u32: 0x010;
EPOLLRDHUP :u32: 0x2000; // peer half-closed (drain, then close)
EPOLLET :u32: 0x80000000; // edge-triggered
EPOLLONESHOT:u32: 0x40000000; // disarm after one delivery
// ── epoll_ctl ops ──────────────────────────────────────────────────────────
EPOLL_CTL_ADD :i32: 1;
EPOLL_CTL_DEL :i32: 2;
EPOLL_CTL_MOD :i32: 3;
// epoll_create1 / eventfd flags (== O_CLOEXEC).
EPOLL_CLOEXEC :i32: 0x80000;
EFD_CLOEXEC :i32: 0x80000;
EFD_NONBLOCK :i32: 0x800;
epoll_create1 :: (flags: i32) -> i32 extern libc;
epoll_ctl :: (epfd: i32, op: i32, fd: i32, event: *EpollEvent) -> i32 extern libc;
epoll_wait :: (epfd: i32, events: *EpollEvent, maxevents: i32, timeout: i32) -> i32 extern libc;
// eventfd: the cross-thread wake channel (epoll's answer to EVFILT_USER).
eventfd :: (initval: u32, flags: i32) -> i32 extern libc;
// errno, bound locally on linux (`__errno_location`; darwin's is `__error`,
// but this module only ever lowers under a linux target).
errno_slot_ep :: () -> *i32 extern libc "__errno_location";
EINTR_EP :: 4;
// ── readiness-flag helpers over one event ──────────────────────────────────
ev_readable :: (e: EpollEvent) -> bool { return (e.events & EPOLLIN) != 0; }
ev_writable :: (e: EpollEvent) -> bool { return (e.events & EPOLLOUT) != 0; }
// EPOLLHUP (full close) or EPOLLRDHUP (peer half-closed) — drain then close.
ev_eof :: (e: EpollEvent) -> bool { return (e.events & (EPOLLHUP | EPOLLRDHUP)) != 0; }
ev_err :: (e: EpollEvent) -> bool { return (e.events & EPOLLERR) != 0; }
// The fd stashed in `data` at registration.
ev_fd :: (e: EpollEvent) -> i32 { return xx e.data_lo; }
// ── thin wrappers ──────────────────────────────────────────────────────────
// Create an epoll instance (close-on-exec). <0 on failure.
ep_create :: () -> i32 {
return epoll_create1(EPOLL_CLOEXEC);
}
// Apply one registration change: add / modify / delete `fd`'s interest
// `events` on `epfd`, stashing `fd` in `data` so `epoll_wait` reports it. True
// on success. For EPOLL_CTL_DEL the kernel ignores the event payload.
ep_ctl :: (epfd: i32, op: i32, fd: i32, events: u32) -> bool {
ev : EpollEvent = .{ events = events, data_lo = xx fd };
return epoll_ctl(epfd, op, fd, @ev) == 0;
}
// Drain ready events into `events` (room for `maxev` entries), waiting at most
// `timeout_ms` (negative = forever). Returns the event count (0 = timeout); -1
// only on a real failure — EINTR is retried (mirror of kqueue's kq_wait).
ep_wait :: (epfd: i32, events: []EpollEvent, maxev: i32, timeout_ms: i32) -> i32 {
while true {
n := epoll_wait(epfd, @events[0], maxev, timeout_ms);
if n >= 0 { return n; }
if errno_slot_ep().* != EINTR_EP { return -1; } // EINTR: reissue
}
return -1;
}