...
This commit is contained in:
227
current/PLAN-HTTPZ.md
Normal file
227
current/PLAN-HTTPZ.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# PLAN-HTTPZ — Stream HTTPZ (production HTTP-server readiness)
|
||||
|
||||
> **STATUS: 🟡 PLANNED — not started.** This stream is being (re)established as a
|
||||
> *tracked* stream. The HTTP/socket/thread work to date shipped ad-hoc under phase
|
||||
> tags in source comments (`S2` socket nonblocking, `S6` pthreads, `S7a` http server,
|
||||
> `S7b` thread-pool handlers, `C3` per-OS selection) but **never had a PLAN/CHECKPOINT
|
||||
> file** — the comments in [socket.sx:3](../library/modules/std/socket.sx#L3) /
|
||||
> [thread.sx:23](../library/modules/std/thread.sx#L23) reference a "PLAN-HTTPZ" that did
|
||||
> not exist until now. Progress tracked in [CHECKPOINT-HTTPZ.md](CHECKPOINT-HTTPZ.md).
|
||||
|
||||
**Goal:** the low-level guarantees needed to run a long-lived HTTP service on Linux in
|
||||
production — *not* a web framework. Survive malformed clients, slow clients, overload,
|
||||
restarts, memory pressure, and a normal Linux deployment without every app author
|
||||
rediscovering the same failure modes. Driven by the user's production-readiness checklist
|
||||
(P0 blockers, P1 hardening, P2 ergonomics), mapped below to concrete sx work.
|
||||
|
||||
**Cadence (IMPASSIBLE):** no commit both adds a test AND makes it pass (lock-to-bail, then
|
||||
flip to green); `zig build && zig build test` green after every step; never regen snapshots
|
||||
while red; scope regens with `-Dname=examples/<cat>/<file>.sx -Dupdate-goldens` + review the
|
||||
diff. HTTP corpus lives in `examples/http/` (`16xx`/`http` category) + `examples/event/`.
|
||||
Stress/fuzz/load harnesses live OUTSIDE the corpus (the corpus runner has a 10s/example
|
||||
timeout and no network sandbox — see [corpus_run.test.zig](../src/corpus_run.test.zig)).
|
||||
|
||||
---
|
||||
|
||||
## Audit of record (grounded against the tree, 2026-06-26)
|
||||
|
||||
What already exists, so the next session does not redo discovery. **Two layers of P0 #1
|
||||
are already done to a high standard; one piece is a literal blocker.**
|
||||
|
||||
### ✅ Solid / Linux-validated — do NOT rebuild
|
||||
- **[event.sx](../library/modules/std/event.sx)** — `Loop` fully branches `OS == .linux`
|
||||
(epoll, lines ~85–251) vs kqueue (~251–323). Ran **6/6 green on real aarch64 Linux** in
|
||||
an Apple `container` VM (kernel 6.18); ABI corpus-locked by `examples/event/1633`.
|
||||
- **[net/epoll.sx](../library/modules/std/net/epoll.sx)** — arch-aware `EpollEvent` layout
|
||||
(12B packed x86_64 / 16B aligned aarch64 via the u32-split trick), correct flags, EINTR
|
||||
retry, `__errno_location`.
|
||||
- **[net/kqueue.sx](../library/modules/std/net/kqueue.sx)** — macOS-only, correct.
|
||||
- **[sched.sx](../library/modules/std/sched.sx)** — M:1 fiber runtime; epoll/kqueue
|
||||
fd-readiness fully branched incl. `EPOLL_CTL_DEL` after `EPOLLONESHOT`. Linux-tested.
|
||||
- **[json.sx](../library/modules/std/json.sx)** — streaming writer, zero-copy views,
|
||||
explicit allocators, stable key order. (Integers only — no floats.)
|
||||
|
||||
### ❌ BROKEN on Linux — the keystone blocker (Phase C3)
|
||||
- **[socket.sx](../library/modules/std/socket.sx)** — Darwin-only, no `OS` branching:
|
||||
- `SockAddr` ([:32](../library/modules/std/socket.sx#L32)) carries Darwin's `sin_len:u8`
|
||||
at offset 0; Linux `sockaddr_in` has no such field → family/port written to wrong
|
||||
offsets, addresses corrupted.
|
||||
- `O_NONBLOCK = 4` — Linux is `2048`; `set_nonblocking` sets the wrong bit.
|
||||
- errno constants are macOS values (`EAGAIN=35`→Linux 11, `EINPROGRESS=36`→115,
|
||||
`ECONNRESET=54`→104, …) → WouldBlock/reset detection silently breaks.
|
||||
- `errno_slot` binds `__error` ([:52](../library/modules/std/socket.sx#L52)) — Linux is
|
||||
`__errno_location`.
|
||||
- **[thread.sx](../library/modules/std/thread.sx)** — Darwin pthread struct sizes:
|
||||
- `MutexBuf = 64B` ([:44](../library/modules/std/thread.sx#L44)) is Darwin's
|
||||
`pthread_mutex_t`; glibc is **40B** → `pthread_mutex_init` overflows the buffer by
|
||||
24B. **Heap corruption on first mutex init under the thread pool.** (`CondBuf = 48B`
|
||||
happens to match glibc — fragile coincidence.)
|
||||
|
||||
### ⚠️ Works, unhardened — [http.sx](../library/modules/std/http.sx)
|
||||
Single-worker event loop; inline (`thread_pool_count = 0`) + pooled handlers; keep-alive
|
||||
+ pipelining; delivery timeouts (`timeout_request_ms`/`timeout_keepalive_ms`); conn cap
|
||||
(`max_conn`) + per-conn request cap (`request_count`); emits 400/413/431/503. Connection
|
||||
state machine: `CONN_FREE/READING/WRITING/KEEPALIVE/HANDLING` with `gen` counter.
|
||||
**Gaps (this stream's HTTP work):**
|
||||
- **Parser:** `Content-Length` only (no `Transfer-Encoding`); no per-header-line size
|
||||
limit; no header-count limit; no request-line/version syntax validation; no duplicate
|
||||
`Content-Length` rejection; no `Content-Length` overflow guard.
|
||||
- **Memory:** `Server.close()` ([:317](../library/modules/std/http.sx#L317)) frees neither
|
||||
the `conns` array, the `PoolState` struct, nor `ps.done` → shutdown leaks.
|
||||
- **Shutdown:** `run()` is an infinite loop; no `Server.stop()`; `close()` is abrupt (no
|
||||
drain).
|
||||
- **Timeouts:** delivery-only. **No handler-execution timeout** — a hung handler blocks
|
||||
the loop (inline) or pins a pool worker forever; no cancellation.
|
||||
- **Observability:** **none.** All accept/read/write/loop faults close the connection
|
||||
silently — no log hook, no counters/metrics.
|
||||
- **Response:** whole response built in one allocation; no streaming, no body-size
|
||||
backpressure; alloc-failure path unhandled.
|
||||
|
||||
### ❌ Absent entirely
|
||||
- **CI:** no `.github/workflows`, no Linux CI. Local `zig build test` on macOS only.
|
||||
- **Fuzz / sanitizers / leak-check:** none. [tests/stress-http.sh](../tests/stress-http.sh)
|
||||
is broken (references deleted `examples/32-http-server.sx`).
|
||||
- **Releases:** no git tags, no CHANGELOG, no stability tiers.
|
||||
- **Security:** no SECURITY.md, no disclosure process, no posture statement.
|
||||
- **Deploy docs:** cross-compile/static-link documented in [readme.md](../readme.md); no
|
||||
systemd/Docker/reverse-proxy/health-check/graceful-shutdown examples.
|
||||
- **TLS:** none yet — to be added natively via an mbedTLS FFI binding (Phase T). Proxy
|
||||
deployment stays documented as an option (D1).
|
||||
- **Routing / query / form helpers:** manual `if req.path == …` dispatch only.
|
||||
|
||||
---
|
||||
|
||||
## Phases (dependency-ordered; checklist item in parens)
|
||||
|
||||
C3 is the keystone — until socket.sx + thread.sx are correct on Linux, **nothing in P0 is
|
||||
honestly testable on Linux**, so Phase C precedes all else regardless of how the rest is
|
||||
sliced.
|
||||
|
||||
### Phase C — Linux foundation (P0 #1) — unblocks everything
|
||||
- **C3a — `socket.sx` per-OS.** Branch `SockAddr` (drop `sin_len` on Linux), `O_NONBLOCK`,
|
||||
the errno constants, and `errno_slot` (`__error` vs `__errno_location`) on `OS`/`ARCH`,
|
||||
mirroring the `inline if OS ==` pattern already proven in `event.sx`/`sched.sx`. No
|
||||
silent fallback defaults (CLAUDE.md rule). Lock a Linux-vs-Darwin layout/const test red,
|
||||
then green.
|
||||
- **C3b — `thread.sx` per-OS.** Correct `MutexBuf`/`CondBuf` sizes per glibc (40/48) vs
|
||||
Darwin (64/48), branched. Memory-safety fix, not cosmetics. Validate under the Apple
|
||||
`container` Linux VM that the pool no longer corrupts the heap.
|
||||
- **C4 — Linux CI.** A workflow building + running `zig build test` (incl. the HTTP corpus)
|
||||
on Linux. The Apple-`container` path is proven for local validation; CI needs a real
|
||||
Linux runner (GH Actions `ubuntu` and/or self-hosted aarch64). First CI of any kind for
|
||||
the repo.
|
||||
- **C5 — Linux socket I/O corpus.** Examples covering accept/read/write/close/error on
|
||||
Linux (today only the macOS-friendly `1633` covers the happy path). Threaded-handler
|
||||
example included.
|
||||
- *Acceptance:* basic server compiles + runs on Linux; HTTP suite passes on Linux; accept/
|
||||
read/write/close/error paths covered; threaded mode correct.
|
||||
|
||||
### Phase H — HTTP hardening (P0 #2–6)
|
||||
- **H1 — Parser hardening (#2).** Max header-line size, max header count, strict
|
||||
request-line + version validation, CRLF strictness, `Content-Length` overflow guard +
|
||||
duplicate/conflicting rejection, **`Transfer-Encoding: chunked` → 501** (full impl in
|
||||
S1), slowloris coverage (delivery-timeout already mitigates). Outcomes: 400 / 413 / 431 /
|
||||
501 / safe-close. Unit + fuzz-seed corpus.
|
||||
- **H2 — Memory lifecycle (#3).** Fix `Server.close()` to free `conns`, `PoolState`, and
|
||||
`ps.done`. Document allocator ownership (long-lived containers must capture their owner —
|
||||
CLAUDE.md rule; the read buffers are intentionally per-conn). Leak gate: start/stop loop
|
||||
with GPA counters asserting zero + repaired stress script for RSS-over-churn.
|
||||
- **H3 — Graceful shutdown (#4).** `Server.stop()` — stop accepting, drain in-flight within
|
||||
a timeout, close idle keep-alives, return from `run()` cleanly. Tests: start/stop/restart
|
||||
in one process; no FD leak; no mem leak.
|
||||
- **H4 — Explicit errors + observability hooks (#5, #9).** Route accept/read/write/loop
|
||||
faults through a pluggable log/error hook instead of silent close; add counters (active /
|
||||
accepted / closed conns, requests served, parser errors, timeouts, rejected, 4xx/5xx,
|
||||
pool queue depth, optional request duration). Hook-based — no forced logging format.
|
||||
(#5 and #9 interlock; land together.)
|
||||
- **H5 — Handler timeout + cancellation (#6).** Per-request deadline enforced in BOTH
|
||||
inline and pool modes; bound time in `CONN_HANDLING`; timed-out → 504 or safe close. A
|
||||
never-returning handler must not permanently consume capacity.
|
||||
|
||||
### Phase T — Native TLS via mbedTLS (#15) — revises the proxy-only posture
|
||||
Native in-process HTTPS by binding a vetted C library (mbedTLS) over FFI — **not** a
|
||||
pure-sx TLS stack (out of scope: security-critical, multi-year). Slotted after H because
|
||||
TLS folds into the same connection state machine + `read_more`/`write_more` paths, which
|
||||
must be stable first. Backend: **mbedTLS** (small pure-C, clean `WANT_READ`/`WANT_WRITE`
|
||||
non-blocking API, static-links cleanly into `--self-contained` musl ELF; Apache-2.0).
|
||||
- **T1 — mbedTLS FFI binding.** New `library/modules/ffi/mbedtls.sx` (or `std/tls.sx`):
|
||||
`extern "c"` decls for `mbedtls_ssl_{init,setup,handshake,read,write,close_notify}`,
|
||||
`mbedtls_ssl_config`, `mbedtls_x509_crt`, `mbedtls_pk_context`, `mbedtls_ctr_drbg` +
|
||||
`mbedtls_entropy`, `mbedtls_ssl_set_bio`, and the `WANT_READ`/`WANT_WRITE` error
|
||||
constants. Loud failure on any setup error (no silent default — CLAUDE.md rule).
|
||||
- **T2 — Transport abstraction in `http.sx`.** Introduce a transport seam so `read_more`/
|
||||
`write_more` go through plaintext (today's `socket.*_nb`) OR TLS, instead of calling the
|
||||
socket directly. mbedTLS BIO callbacks bridge to the non-blocking fd: map socket
|
||||
`WouldBlock` → `MBEDTLS_ERR_SSL_WANT_READ/WANT_WRITE`.
|
||||
- **T3 — Handshake state + event-loop integration.** New `CONN_TLS_HANDSHAKE` state before
|
||||
`CONN_READING`; drive `mbedtls_ssl_handshake` incrementally, mapping `WANT_READ` →
|
||||
`loop.add_read`, `WANT_WRITE` → `loop.add_write`; handshake deadline (reuse
|
||||
`timeout_request_ms`); graceful `close_notify` on shutdown (ties into H3).
|
||||
- **T4 — TLS config surface.** `Config` gains `tls_enabled`, cert/key/chain paths, min
|
||||
version (default TLS 1.2+, prefer 1.3), optional ALPN, SNI (single default cert first;
|
||||
multi-cert later). Cert/key load failure is a loud `HttpErr`, never a silent fallthrough.
|
||||
- **T5 — Tests + static-link + Linux validation.** TLS corpus example: in-process mbedTLS
|
||||
*client* handshakes against the server over loopback with a self-signed cert fixture
|
||||
(under `examples/http/16xx-…/`); cover bad-cert, handshake-failure, and mid-handshake
|
||||
client-abort paths. Verify a `--self-contained` static build links mbedTLS; run on macOS
|
||||
+ aarch64 Linux (Apple `container`). Document the per-target mbedTLS static-archive
|
||||
requirement for self-contained builds (vendor vs system).
|
||||
|
||||
### Phase S — Streaming, stress, stability (P1)
|
||||
- **S1 — Streaming responses + chunked out (#10).** Explicit `Content-Length`; stream large
|
||||
bodies without buffering the whole response; write-backpressure-aware send; header-set /
|
||||
status / content-type helpers. (Builds on H1's chunked scaffolding.)
|
||||
- **S2 — Request-body streaming (#11).** Incremental body reader, configurable max, early
|
||||
reject, mid-body-disconnect handling, backpressure-aware reads. Enables real inbound
|
||||
chunked bodies.
|
||||
- **S3 — Fuzz harness (#7).** libFuzzer/AFL targets: request-line, header, `Content-Length`,
|
||||
keep-alive + pipeline state machine, partial reads, malformed bodies, random close
|
||||
timing. Runs manually + in CI. Crash/panic/hang = bug.
|
||||
- **S4 — Load/stress suite (#8).** Repair + expand the stress scripts: many short-lived,
|
||||
many keep-alive, slow clients, large bodies at the limit, pool saturation, FD exhaustion
|
||||
→ 503/backpressure (not crash), RSS-over-time. Document expected overload behavior.
|
||||
- **S5 — Concurrency model docs (#12).** Write up the allocator/thread-safety rules already
|
||||
asserted in [thread.sx:11](../library/modules/std/thread.sx#L11) + the http.sx header:
|
||||
handler execution model, per-request lifetime, what may be retained after a handler
|
||||
returns, misuse cases.
|
||||
- **S6 — API stability + security posture (#13, #14).** Tag a milestone; define the stable
|
||||
std subset (`http`/`socket`/`event`/`thread`/`mem`); SECURITY.md + disclosure process +
|
||||
the "reverse-proxy-only, not for direct internet exposure" posture statement + known
|
||||
limitations.
|
||||
|
||||
### Phase D — Deploy & ergonomics (P2)
|
||||
- **D1 — Reverse-proxy + deployment docs (#15, #20).** With native TLS shipping in Phase T,
|
||||
proxy deployment is now *an option, not the only option* — document both. Cover proxy TLS
|
||||
termination, forwarded headers, client-IP, size limits, timeouts, keep-alive, recommended
|
||||
proxy settings; AND native-TLS direct-exposure guidance (cert rotation, cipher/version
|
||||
policy). Plus systemd unit, Docker, health-check endpoint, graceful-shutdown, logging
|
||||
examples; release-binary build + static/dynamic linking notes (incl. the mbedTLS
|
||||
static-archive note from T5; cross-compile already in readme.md).
|
||||
- **D2 — Routing + query/form helpers (#16, #17).** Thin layer over manual dispatch: method
|
||||
+ path routing, path params, query parsing, 404/405, per-route limits/timeouts; form
|
||||
(urlencoded/multipart) + JSON request/response helpers over the existing json.sx.
|
||||
- **D3 — Honest benchmarks (#18).** Revive [bench/run.sh](../bench/run.sh): plain-text,
|
||||
JSON, keep-alive, concurrency, pool, slow-client vs a baseline server; record hardware/
|
||||
OS/flags/command; measure latency, throughput, memory, error rate.
|
||||
- **D4 — Compiler-confidence framing (#19).** Largely already true (corpus + `issues/`
|
||||
regressions, subprocess-isolated runner). Add the "supported vs experimental" labelling
|
||||
for language + std features; ensure production-critical features have corpus coverage.
|
||||
|
||||
---
|
||||
|
||||
## Decisions Log (HTTPZ specifics)
|
||||
- **Native TLS via an mbedTLS FFI binding (Phase T)** — supersedes the original
|
||||
reverse-proxy-only posture (2026-06-26). The server gains in-process HTTPS; reverse-proxy
|
||||
deployment stays supported and documented (D1) as an option. **No pure-sx TLS stack** —
|
||||
TLS is security-critical and is delegated to the vetted C library. mbedTLS chosen over
|
||||
OpenSSL/LibreSSL for its small pure-C footprint, clean non-blocking `WANT_READ`/
|
||||
`WANT_WRITE` API, and clean static-linking into `--self-contained` musl builds
|
||||
(Apache-2.0).
|
||||
- **`Transfer-Encoding: chunked`: reject (501) in H1, implement in S1/S2.** Pragmatic P0
|
||||
minimum is explicit rejection; full chunked support is gated on the streaming work.
|
||||
- **Stress/fuzz/load live outside the corpus.** The corpus runner has a 10s/example
|
||||
timeout and no network sandbox; long-running adversarial harnesses are separate scripts
|
||||
wired into CI, not `examples/`.
|
||||
- **No silent fallback defaults in any C3 branching** (CLAUDE.md REJECTED PATTERNS): a
|
||||
failed/unhandled OS or arch arm bails loudly, never picks a "reasonable-looking" Darwin
|
||||
default.
|
||||
Reference in New Issue
Block a user