sx/design/execution-evolution-roadmap.md

# Execution-Model Evolution — Roadmap (comptime JIT · async · concurrency · hot-reload)

> Status: **exploratory design-of-record.** Captures the forward plan for sx's
> execution model across five interlocking threads. Not yet an active
> `PLAN-*`/`CHECKPOINT-*` stream — this is the shared design the streams would be
> carved from. Cross-platform shipping (the bundled-zig backend + the sx bundler)
> is **already landed**; see [bundled-zig-link-backend-design.md](bundled-zig-link-backend-design.md)
> and [../current/PLAN-DIST.md](../current/PLAN-DIST.md).

---

## 0. The thesis

sx's compiler stays small by pushing capability into **library sx + three general
primitives** (`inline asm`, `extern`/`export`, `atomics`) rather than baking
features into codegen. Concretely:

- **Async is a library, not a language feature** — colorblind, stackful fibers
  behind an `Io` interface (Zig-inspired). No function coloring, no
  async→state-machine transform. The implementation is pure sx down to a per-arch
  inline-asm context switch.
- **Comptime gains a JIT escape hatch** — the interpreter stays the default
  (debuggable, portable), but drops to a host-JIT for the one thing it can't
  walk (inline asm) and, later, for whole fragments (the bundler).
- **One shared substrate** — a persistent ORC LLJIT + host-target emitter — serves
  comptime-asm, the bundler, and JIT-resident hot-reload.

The honest trade is **small *surface*, but each primitive is *deep*** — not "small
compiler." The net-new **compiler** obligations this plan adds (all verified absent
today): **atomics lowering** (N1), **generic enums** `enum($T)`, **`type_info` +
`reify` + `field_type`** (comptime type construction), **`callconv(.naked)`**,
**repointable-`context` codegen** (+ per-fiber stack-limit), the **S1 persistent JIT
spine**, **C1 thunk synthesis**, **comptime-asm lifting** (C3), and (later) the **S2
ORC C++ shim**. Async itself is genuinely a library; the *enabling primitives* are a
major codegen/runtime investment. Already landed: `inline asm` (in flight),
`extern`/`export`, the `!`/`try`/`catch`/`onfail`/`raise` ERR stream, value-level
reflection, the `sx run` ORC LLJIT, and the host-FFI trampolines.

---

## 1. The spine (shared substrate)

| ID | Piece | What | Size |
|----|-------|------|------|
| **S1** | Persistent JIT executor | A long-lived ORC LLJIT + a host-triple `LLVMEmitter` + a compiled-fragment cache, plumbed into the interpreter. Today the LLJIT exists only for `sx run`'s `main` ([target.zig:319](../src/target.zig#L319)); the emitter carries one target machine ([emit_llvm.zig:274](../src/ir/emit_llvm.zig#L274)). | L |
| **S2** | ORC C++ shim | `MachOPlatform::Create` + redirectable/lazy-reexport symbols. The bare `LLVMOrcCreateLLJIT` can't do thread-locals, C constructors, or symbol redefinition — the wall the C-with-sx JIT spike hit (`_Thread_local` SIGABRT; `errors-*` examples crashed). Required by any non-trivial JIT or symbol repoint. | M |

S1/S2 are the spine: built once, consumed by **C1** (the FFI thunks — the main
near-term consumer), **C3**, and (later) **R2**. S1 alone suffices for C1/C3 (bare
calling/asm thunks — no TLS/ctors); S2 is only needed for R2 and JIT-ing C-with-sx.

---

## 2. Comptime / build layer

| ID | Piece | Unblocks | Depends | Size |
|----|-------|----------|---------|------|
| **C1** | **Real comptime FFI — JIT calling-thunks (LLVM = single ABI authority).** Trivial calls (scalar/ptr/string args, single-reg return) keep the existing `host_ffi.zig` trampoline fast-path; everything else (floats, structs-by-value, aggregate returns, >8 args, varargs) synthesizes a per-signature thunk, JIT-compiles it via **S1**, and calls it with an args buffer the interpreter fills by known layout (`type_info`). **LLVM emits the ABI-correct call — the same lowering as runtime codegen — so comptime and runtime FFI share ONE ABI implementation.** Rejected: libffi (foreign 2nd ABI impl), hand-rolled sx+asm (3rd impl + drift risk + needs C3 to run its own asm leaf anyway). | struct/string/slice/float signatures at comptime; full C interop in `#run`; lifts the bundler's API straightjacket; unifies comptime+runtime FFI | S1 (fast-path: none) | L |
| **C2** | **`#compiler` → `extern` collapse** — BuildOptions hooks become real exported C symbols resolved through C1; `*BuildConfig` threaded via global/handle; delete `.compiler_expr`/`compiler_call`/Registry. | one FFI mechanism, not two | C1 (`extern`/`export` already shipped) | M |
| **C3** | **Comptime asm via host-JIT** — stop bailing on `inline_asm` ([interp.zig:1019](../src/ir/interp.zig#L1019)); lift the block (operand model at [inst.zig:354](../src/ir/inst.zig#L354): inputs/`out_value`/`out_place`/`out_ty`/clobbers) to a host-arch thunk via `LLVMGetInlineAsm`, JIT, call through C1, cache by template+sig. | running asm-containing code at comptime | S1, C1 (+S2 non-trivial) | M |
| **C4** *(DROPPED)* | **JIT-the-bundler** — **not built** (Decision 6). Interp+C1 is the shipping bundler (I/O-bound, so native speed is moot; C1 closes the only capability gap). Remains an always-available S1 optimization if profiling ever shows the bundler's *own logic* is a hotspot. | — | — | — |

**Residue:** cross-arch comptime asm (C3) can't run on the host — narrows the bail
to the cross-compile case; needs a sharp diagnostic ("asm targets `<arch>`, host
is `<host>`").

---

## 3. Concurrency primitives (atomics + threads)

> **Why this is its own section:** we are doing **multiple OS threads**, so the
> async runtime and any lock-free structure need real atomics. OS threads already
> exist; atomics do not.

| ID | Piece | State | Size |
|----|-------|-------|------|
| **N1** | **Atomics — NET-NEW compiler feature.** Atomic load/store/RMW (`add/sub/and/or/xor/swap` + `fetch_min`/`fetch_max`; no `nand`), `compare_exchange`/`_weak` (→ `?T`, **null = success**), and fences, with orderings (relaxed/acquire/release/acq_rel/seq_cst). LLVM provides all — an **emit** feature, not a runtime library. **Surface LOCKED = `Atomic($T)` wrapper + `Ordering` enum** (not `@atomic_*` — `@` is address-of in sx). | **lowering absent** — zero LLVM `atomicrmw`/`cmpxchg`/`fence` emission today; some IR/inference scaffolding exists | M |
| **N2** | **OS threads + pthread Mutex/Cond + worker Pool** | **landed** — [std/thread.sx](../library/modules/std/thread.sx) (`pthread_create`/`join`/`detach`, in-place `Mutex`/`Cond`, bounded `Pool`). NOTE: pthread mutex **blocks the OS thread** — it is *not* fiber-aware (it would park every fiber on that thread); fiber-aware sync is N3, built on N1. | — |
| **N3** | **Fiber-aware sync** — mutex / channel / waitgroup that **suspend the fiber**, not the OS thread. Hybrid: atomic fast-path (N1) + fiber-suspend slow-path (A2/A5). Distinct from the pthread primitives in N2. | new library | M |

**Compiler obligation for N1:** the emit must map sx orderings to LLVM's and **not
reorder across atomics/fences**. Comptime is single-threaded, so the interpreter
can treat atomic ops as ordinary ops (seq_cst is trivially satisfied with one
thread) — no interp atomics machinery needed.

**N1 is a prerequisite for M:N scheduling (A5) and N3, and is broadly useful**
(lock-free queues, refcounts, the allocator). It is the load-bearing new primitive
this revision adds.

---

## 4. Async — colorblind, stackful, pure-sx

**Commitment:** no function coloring, no async→state-machine transform. Async is a
capability carried in `context` (like `context.allocator`), not a property of a
function's signature. A function does I/O through `context.io`; whether the call
suspends is decided by the `Io` *implementation*, transparently.

| ID | Piece | Notes | Size |
|----|-------|-------|------|
| **A1** | **`Io` interface + `context.io`** — a protocol/vtable threaded like `Allocator`. `io.async(fn,args) → Future`, `future.await`, cancellation. | leverages protocols + context | M |
| **A2** | **Stackful coroutine runtime — in sx lib, NOT a compiler builtin.** The context-switch is a `callconv(.naked)` sx fn with an inline-asm body (save callee-saved + SP/LR into `*from`, load from `*to`, `ret`); fiber bootstrap + stack alloc (`mmap`+guard via `extern`) also sx. The **compiler's** job is only (a) the general primitives — inline asm, `callconv(.naked)`, atomics — and (b) **fiber-safe codegen**: `context` lowered as a *repointable indirection* (never raw TLS) so the switch can repoint it, and stack-limit guards (if emitted) read from a swappable per-fiber location. Most arch-delicate sx in the tree (must match the platform callee-saved set + the compiler ABI), but it's inspectable sx, not a black box. | per-arch, arch-gated; co-validate vs codegen | M |
| **A3** | **Event-loop `Io` impls** — kqueue / epoll / io_uring drive readiness, then the (now-ready) syscall via C1. Plus a trivial **blocking `Io`**. | pure sx around syscall `extern`s | L |
| **A4** | **Stdlib I/O rework** — fs/socket/process take/use `context.io` instead of raw blocking syscalls, so existing calls participate in async. | mirrors the allocator-threading rule | M |
| **A5** | **Schedulers — M:1 → N×(M:1) → M:N, all sx std-lib `Io` vtables (committed; M:N last, not deferred).** M:1 first (minimal vehicle to validate the colorblind stack; covers I/O-bound). N×(M:1) = first parallel step (per-thread M:1 loops + `std/thread.sx` spawn; shared state uses N1 atomics — expected under parallelism, not a wart). M:N work-stealing last (most machinery: thread-safe steal queues + migration + errno/TLS discipline). All over N1 atomics + the A2 asm context-switch + `extern` syscalls. **pinning** API for thread-affine work (UI main thread, GL context). | see §4.3 | M (M:1) / M (N×M:1) / L (M:N) |

### 4.1 How control enters sx (the colorblind model)

- **sx→sx is ordinary.** The whole call chain lives on the fiber stack; a suspend
  at a leaf `io.*` freezes the native stack verbatim. No frame knows it suspended.
  **Zero special handling at call boundaries** — that's the point.
- **Three inbound boundaries** where the runtime enters sx:
  1. **Task entry** (`io.async(fn)`) — a trampoline starts `fn` on a fresh fiber
     stack via the normal calling convention.
  2. **Resumption** — a context-switch (asm), *not* a call; sx continues mid-stack.
  3. **C callback → sx** — must be `export`/`callconv(.c)`; runs on the event-loop
     stack (not a fiber) so it **cannot itself suspend** — it may resume/enqueue a
     fiber or run a non-suspending sx fn to completion (leaf-only).

### 4.2 `context` is fiber-local (the key obligation)

`context.io`/`context.allocator`/the `push Context` stack are dynamically scoped.
Fibers time-share OS threads (and **migrate** under M:N), so `context` must travel
**with the fiber** — saved/restored on every context-switch — **never a raw TLS
read.** A spawned task snapshots the spawner's context, then evolves its own
`push Context` stack. This is the CLAUDE.md "capture your owning allocator" rule one
level up: ambient state that outlives a suspension point must be carried by the
fiber.

### 4.3 Threads & the two hazard classes (why atomics)

| Model | Parallelism | Migration | Hazards |
|-------|-------------|-----------|---------|
| **M:1** (1 OS thread) | none | none | cooperative, race-free — simplest |
| **N×(M:1)** (per-thread schedulers, no migration) | yes | none | **data races** on shared state → atomics/locks |
| **M:N** (work-stealing) | yes | yes | data races **+** TLS-migration hazards |

- **Parallelism hazard** (any N>1): shared mutable state races → needs **N1
  atomics** + N3 fiber-aware sync. The M:1 "no locks" simplicity is gone.
- **Migration hazard** (M:N only): a fiber that moves threads across a suspend
  reads the *wrong* thread's TLS. **`errno` must be captured immediately** after
  each syscall; **`context` must be fiber-local** (§4.2) — non-negotiable under M:N.
- **Pinning** (`io.pinToThread()`): some work must stay put — the **UI main
  thread** (UIKit/macOS/Android — directly the app targets in §6), OpenGL
  current-context, TLS-using FFI. M:N needs a "don't migrate / main-thread-only"
  fiber attribute (Go's `LockOSThread`).

### 4.4 Pure-sx boundary

Everything is sx except the irreducible FFI floor: the **asm context-switch**
(per-arch, in `.sx`), **syscall `extern`s** (kernel-implemented, like any libc
binding), and **raw stack memory** (`mmap`). The schedulers, event loops, futures,
cancellation, and sync primitives are ordinary sx. Payoff: **swappable `Io`
vtables** — blocking, io_uring, kqueue, a **mock `Io`** for tests, a
**deterministic-simulation `Io`** (fake clock, scripted readiness) for reproducible
concurrency tests — all libraries.

### 4.5 Comptime async = blocking `Io`

At comptime install the **blocking `Io`**: `io.*` just blocks; no fibers, no
scheduler, no suspend. Same source, different vtable. The interpreter never needs
suspend/resume, and the FFI (C1) needs no async awareness. This is *why* the
colorblind model resolves comptime async for free.

### 4.6 Syntax surface (grounded against the grammar)

All of the concurrency/atomics surface lands on **existing** sx grammar — `enum`
tagged unions + `if x == { case … }` match ([specs.md:364,408](../specs.md#L408)),
first-class **tuples** with named fields ([specs.md:815-852](../specs.md#L815)),
`=>` closures, `struct($T)` generics, `callconv(...)`, and the ERR keywords
(`try`/`catch`/`onfail`/`raise`/`error`). `race`/`async`/`await`/`atomic` are **not
reserved words** ([specs.md:168](../specs.md#L168)), so they stay library
types/methods — no keyword additions. One genuinely-new compiler capability is
required (see end).

**Atomics (N1) — generic wrapper type.**
```sx
Ordering :: enum { relaxed; acquire; release; acq_rel; seq_cst; }
Atomic   :: ($T: Type) -> Type #builtin;   // atomicity carried by the type

counter : Atomic(i64) = .init(0);
counter.store(0, .relaxed);
n    := counter.load(.acquire);
prev := counter.fetch_add(1, .seq_cst);            // + fetch_sub/and/or/xor (min/max: open)
old  := counter.swap(42, .acq_rel);
got  := counter.compare_exchange(old, new, .acq_rel, .acquire);        // strong → ?T (null = success)
got2 := counter.compare_exchange_weak(old, new, .acq_rel, .acquire);   // may fail spuriously; for retry loops
fence(.seq_cst);
```
- CAS takes **two orderings** (success, failure); failure ordering may not be
  `release`/`acq_rel` nor stronger than success — enforce in the compiler.
- Weak vs strong matters on **aarch64** (LL/SC) — weak in a loop is the idiom;
  both compile identically on x86.

**Channels (N3) — methods only (no `<-`); `recv` returns a tagged union (not `(v, ok)`).**
```sx
RecvResult :: enum($T: Type) { value: T; closed; }        // ordinary generic enum (not the race-synthesized union)
TryResult  :: enum($T: Type) { value: T; empty; closed; } // non-blocking: 3 states a bool can't express

ch := Channel(i64).make(16);     // capacity; .make() unbuffered
ch.send(v);
if ch.recv() == { case .value: (v) { use(v); }  case .closed: { /* drained */ } }
ch.close();
// ergonomic layer: `for ch (v) { … }` consumes until closed, hiding RecvResult
```

**Fiber-aware locks (N3) — explicit lock + `defer` (no guard sugar).**
```sx
m : Mutex;
m.lock();  defer m.unlock();
```

**Futures & spawn (A1).**
```sx
f := context.io.async(worker, arg);     // Future(R)
r := f.await();                         // suspends this fiber
f.cancel();
d := context.io.timeout(5000);          // a Future too — raceable like any other
```

**Pinning (A5) — spawn attribute, accepts a thread handle.**
```sx
PinTarget :: enum { any; main; on: Thread; }            // default = .any (may migrate)
f := context.io.async(render, pin = .main);
f := context.io.async(worker, pin = .on(some_thread));
```

**`race` (Zig model — over futures, named tuple in → synthesized tagged-union out).**
The input is a **named tuple** (positional also allowed → `.0`/`.1` tags); the
result is an anonymous tagged union whose variants mirror the tuple's labels, each
payload = that field's `Future(T)` projected to `T`. Losers are **cancelled and
joined** before `race` returns (structured).
```sx
fa := context.io.async(read_a, conn);     // Future(A)
fb := context.io.async(read_b, conn);     // Future(B)

winner := context.io.race((a: fa, b: fb));   // RaceResult = enum { a: A; b: B }
if winner == {
    case .a: (v) { handle_a(v); }            // v : A
    case .b: (v) { handle_b(v); }            // v : B
}
// positional form: race((fa, fb)) → tags .0 / .1
```
The Go-style handler-map and the map literal that propped it up are **dropped** —
`race` over futures subsumes select, and cancellation handles the losers.

**Cancellation rides ERR.** A cancelled `io.*` **raises**; the fiber unwinds
through `defer`/`onfail` (`try`/`catch`/`raise` are real keywords). Cancellation is
**cooperative** (observed only at suspend points — every `io.*` is a cancellation
point) and **structured** (`race` joins losers' teardown before returning). No
parallel unwind path — it reuses the error channel.

**Context switch (A2).**
```sx
swap_context :: (from: *Fiber, to: *Fiber) callconv(.naked) {
    asm { /* save callee-saved + SP into *from; load from *to; ret */ };
}
```
`callconv(.naked)` ≠ `callconv(.c)`: **no prologue/epilogue/frame** — required
because a context switch deliberately makes SP-in ≠ SP-out (a `.c` epilogue would
restore from the wrong stack). Body is a single `asm` block; you emit your own
`ret`. Args arrive in ABI registers, read directly from asm.

**One new compiler capability (gates `race`):** *comptime tuple→tagged-union
synthesis.* Reflection today only **reads** types (`field_count`/`field_name`/
`type_of`); `RaceResult(T)` must **construct** an anonymous `enum` from a tuple's
`(label, payload-type)` pairs. Supporting pieces: a `field_type($T, i) -> Type`
reflection accessor (we have value-level `field_value` + `type_of`, but type-only
field projection is missing) and `Future(T) → T` projection (falls out of
generics). This is the generic "derive a sum from a product" — useful beyond
`race`.

---

## 5. Dev loop / hot-reload

| ID | Piece | Notes | Depends | Size |
|----|-------|-------|---------|------|
| **R1** | **Hot-reload (dylib swap)** — host owns `State`+allocator; reloadable module is a `.dylib` with a fixed `export` interface; watch→rebuild→`dlopen`→rebind→`dlclose`. State survives (host-owned). | leans on `export` (shipped); sidesteps S2; native | — | M |
| **R2** | **Hot-reload (JIT-resident)** — program runs under S1's LLJIT; reloadable calls route through ORC indirection stubs, repointed on change. Finer granularity; same spine. | | S1, S2 | L |
| **R3** | **Incremental compilation** — dependency tracking + recompile-only-changed. Perf enabler; coarse per-file v1 suffices first. | | — | L |

**Core rule:** the data that must survive a reload cannot be owned by the code that
reloads. Code/state separation — the CLAUDE.md owning-allocator discipline, one
level up.

**Residue — state migration on layout change:** body-only changes hot-swap;
layout/signature/global-type changes are **detected** (compare new vs running
`State` layout via `types.zig`) and trigger **rebuild+restart**. Migration hooks
(`on_reload(old)→new`) are a hard later item. Design against *silent* corruption.

---

## 6. Cross-platform (mostly landed) — from a macOS laptop

### 6.1 Landed

| Capability | State | Reach from a mac |
|---|---|---|
| `extern`/`export` C linkage | done (replaced `#foreign`) | all targets |
| Bundled-`zig cc` cross-link backend | Phases 0–2 done; packaging pending | **macOS, Linux(-musl/static), Windows(-gnu)** verified |
| sx-side bundler (`.app`/`.apk`) | done | macOS, iOS sim/device, Android |
| JIT `sx run` (ORC LLJIT) | done | host |
| Target shorthands | done | `macos[-arm]`, `linux[-musl[-arm]]`, `windows[-gnu]`, `ios[-arm]`, `ios-sim[-arm/-x86]`, `android[-arm64/-x86_64]`, `wasm` |

### 6.2 Workflows

```sh
# macOS (native): inner loop is JIT; ship is Mach-O / .app
sx run app.sx
sx build app.sx -o app
sx build app.sx --bundle MyApp.app

# Linux (cross, landed killer feature): static, zero-dep ELF
sx build app.sx --target linux-musl -o app      # scp anywhere, runs

# Windows (cross, landed, MinGW path): PE32+
sx build app.sx --target windows-gnu -o app.exe # cf. example 1660 (win32)

# iOS simulator (mac-only host)
sx build app.sx --target ios-sim --bundle App.app

# iOS device — signing threaded via the build program (BuildOptions setters)
#   #run { o := build_options(); o.set_bundle_id(...); o.set_codesign_identity(...);
#          o.set_provisioning_profile(...); }
sx build build.sx --target ios --bundle App.app

# Android (cross + bundle): javac → d8 → aapt2 → zipalign → apksigner, then adb
sx build app.sx --target android --apk app.apk
```

### 6.3 Where the roadmap lights up cross-platform

- **C1 + C4** → the iOS/Android **bundlers** (orchestrate ~a dozen host tools at
  comptime; biggest win; always host-arch so no cross-arch risk).
- **R1/R2 + A1–A5** → the **inner dev loop for non-host targets**: push-a-dylib +
  remote-trigger-reload over an async laptop↔device channel — a capability that
  *doesn't exist today* short of full rebuild+reinstall.
- **A1/A2 colorblind `Io`** → the dev tooling is itself async, and the **same
  networking code runs blocking inside the bundler** (`adb push`) and async in the
  live session — no coloring.
- **Pinning (A5)** → the UI render fiber pins to the main OS thread on every app
  target.

**The single hard constraint the matrix exposes:** cross builds mean target arch ≠
host arch, so **C3's residue bites** — comptime/`#run` code reaching *target-arch*
inline asm can't execute on the mac. Native macOS dev never hits it; every cross
target must gate comptime asm to host-arch (`when host_arch == …`) or get a loud
diagnostic.

---

## 7. Linear build sequence (async-first — no parallel streams)

Single ordered list; deps satisfied at every step. **Async-first** (user-chosen): the
async story needs no JIT spine (syscalls use the existing trampoline FFI; comptime
async = blocking `Io`), so the FFI/JIT cluster comes *after*. C4 is omitted (dropped —
an S1 optimization if ever profiled). Net-new compiler prereqs (per the codebase
grounding) are explicit steps, not buried.

**Foundations — compiler primitives the async story needs (all net-new):**
1. **N1 — Atomics lowering.** IR/inference scaffolding exists; add LLVM
   `atomicrmw`/`cmpxchg`/`fence` emission + orderings. Surface = `Atomic($T)` wrapper.
   Gates channels/N3 + parallel schedulers.
2. ~~**Generic enums** `enum($T)`~~ **DROPPED.** `RecvResult($T)`/`TryResult($T)` are
   **type-fns over `reify`** (step 3), not a new `enum($T)` language feature — and
   type-fns (user `($T)->Type` in type position) **already work** (e.g.
   [`Make`](../examples/0208-generics-value-param-type-function.sx),
   [`Complex`](../examples/0201-generics-generic-struct.sx)). A declarative `enum($T)`
   surface, if ever wanted, is later *sugar* desugaring to a type-fn-over-`reify`.
3. **`type_info` + `reify` + `field_type`** — comptime metaprogramming floor. Gates
   `race` synthesis **and** channel `RecvResult`/`TryResult` (all type-fns over
   `reify`; **generic-enum syntax dropped**). **Validated against the codebase (3
   reviewers): a small extension reusing existing machinery throughout — not net-new
   architecture.** Five contracts:
   1. **Nominal identity via type-fn memoization** — type-fns dedup by mangled
      `(fn,args)` name (generic.zig:1620-1629) + reify `findByName`, so `RecvResult(i64)`
      is one `TypeId` and the body runs once. (NOT structural dedup — enums are
      nominal via `nominal_id`, types.zig:1110.)
   2. **Functional through codegen** — layout / construct / match+exhaustiveness /
      `toLLVMType` / `type_name`+format are **all type-table-driven, zero AST
      coupling**, so a backing-decl-less reify'd enum flows through unmodified.
   3. **Validate loudly** at the single `intern`/`internNominal` choke point
      (types.zig:411-439): reject dup variants / bad backing / unresolved payloads.
   4. **Comptime-only, JIT-free** — a type-table op in the interp; no S1 dependency
      (keeps reify, hence channels + `race`, off the JIT critical path).
   5. **Reference-based self-reference (v1)** — `*Self`/`[]Self` payloads via the
      reserve-placeholder→complete path recursive *source* types already use
      (nominal.zig:86/108/120, types.zig:442); **by-value recursion rejected** (loud,
      infinite size). reify gains a `reify_rec((self) => …)` builder form.
   - **Type-minting precedents (7):** monomorphization, protocol vtables, tuples,
     vector/array, ptr/slice ctors, FFI stubs, **type-fn instantiation** — all
     construct `TypeInfo` programmatically + `intern()`. **Residual = plumbing, not
     capability:** name reify-results by the instantiation's mangled name (done for
     inline-struct bodies — extend to reify-results) + reify input validation.
4. **`callconv(.naked)`** — extend `CallConv {default, c}` (types.zig:169) + skip
   prologue/epilogue lowering. Gates A2.
5. **Repointable-`context` codegen** — lower `context` as a swappable indirection
   (never raw TLS) + per-fiber stack-limit. Compiler obligation; gates A2 *and*
   cross-fiber `context.io` correctness. (Reviewer note: this is a **prerequisite**
   of A2, not a successor.)

**Async runtime — sx lib over the primitives:**
6. **A1 — `Io` interface + `context.io` + `Future` + `cancel()` API.**
7. **A2 — fiber runtime** (naked context-switch asm, bootstrap, `mmap` stacks).
8. **A3 — blocking `Io` → deterministic-sim `Io` (keystone, calibrated) → event-loop `Io`.**
9. **A5·M:1 — single-thread scheduler.**
10. **N3 — fiber-aware sync** (channels/mutex/waitgroup; `recv → RecvResult`).
11. **A6 — Cancellation.** `.canceled` in the `!` channel (model a); per-fiber atomic
    flag (N1); every `io.*` a cancellation point; structured cancel-and-join; **masked
    during cleanup**.
12. **A4 — stdlib I/O rework** (fs/socket/process onto `context.io`).
13. **A5·N×(M:1)** — first parallel (errno-capture + `context`-fiber-local discipline).
14. **A5·M:N** — work-stealing (steal queues + migration + pinning).

**Then comptime / FFI / JIT cluster:**
15. **S1 — persistent JIT spine** → 16. **C1 — real FFI (LLVM = ABI authority, on S1)**
    → 17. **C2 — `#compiler`→`extern`** → 18. **C3 — comptime asm** (S1 + C1; +S2 if
    TLS/ctors).

**Deferred tail:**
19. **S2 — ORC C++ shim** (highest-risk — see §8; macOS `MachOPlatform`; ELF/COFF
    unplanned) → 20. **R1 — dylib reload** (shipped `export`) → 21. **R2 —
    JIT-resident reload** (S1 + S2; **↔ async live-fiber coupling**, §8) → 22. **R3 —
    incremental compilation**.

Hard edges to remember: **C1 depends on S1** (the non-trivial FFI cases); **C3 depends
on C1** (calls through its thunk path); **R1/R2 couple to the async runtime** (can't
hot-swap code with live suspended fibers — runtime + long-lived fibers stay
persistent, only leaf logic reloads).

---

## 8. Irreducible hard problems (detect-and-degrade, don't pretend)

1. **State migration across layout change** (R1/R2) → v1 detects + rebuild/restart;
   migration hooks later.
2. **Cross-arch comptime asm** (C3) → can't run on host; narrows the bail + loud
   diagnostic; gate to host-arch.
3. **M:N migration hazards** (A5) → errno-capture discipline + fiber-local context
   (mandatory), pinning for thread-affine work.

### 8.1 Highest technical risks (from review — ranked, async-first lens)

1. **A2 context-switch correctness** (in the async critical path). Silent stack
   corruption, per-arch, **untestable by the deterministic-`Io` harness** (it tests
   *scheduling*, not the *switch*); a one-register slip is invisible until it crashes
   on the right arch. Couples *library asm* to the *compiler ABI* — ABI drift breaks
   it silently later. → needs a dedicated **switch-stress test** (§10).
2. **`reify` → anonymous-tagged-union → match-codegen** (gates `race` + channels).
   **DE-RISKED by review** (§7 step 3): all enum stages are type-table-driven with
   zero AST coupling, identity is handled by existing type-fn mangled-name memoization,
   and forward-declaration for self-ref already exists. Residual is *plumbing*
   (name reify-results by mangled name + input validation), not new architecture.
3. **Deterministic-`Io` is the test keystone yet itself uncalibrated** — a buggy
   deterministic scheduler yields deterministic-*wrong* stdout that snapshots lock in.
   → calibrate against the blocking `Io` / property-test fixed order (§10).
4. **`context`-fiber-local + errno discipline** (A5 M:N). "Non-negotiable" but
   enforced by manual rule, not the compiler; M:1 can't even exercise migration.
5. **S2 ORC shim** (deferred, but highest-risk when reached): only C++ in the tree,
   **already failed a spike** (`_Thread_local` SIGABRT), `MachOPlatform` is
   macOS-specific — **Linux/Windows JIT-resident reload + non-Mac TLS/ctor JIT have no
   named plan**. One "M" box hides a per-OS effort.
6. **C1 args-buffer layout-vs-ABI** — "LLVM emits the call" covers the *call*, not the
   interpreter's *buffer pack* from `type_info`. Disagreement on edge layouts
   (over-aligned/empty structs, aarch64 small-struct register splitting, `bool`) =
   silent comptime corruption. → adversarial layout cases (§10).

---

## 9. Decisions log (all resolved)

**Sequencing — locked:** **async-first** (§7). The async cluster (steps 1–14)
precedes the FFI/JIT cluster (15–18) because async needs no JIT spine. **Cancellation
(A6) = model (a)** — a `.canceled` variant in the **existing `!` error channel** that
`io.*` already returns (I/O is inherently fallible, so `io.*` is already `!`-typed —
the "keep calls clean" argument for the non-local-`raise` model is moot). Reuses
`!`/`try`/`catch`/`onfail`; no new unwind primitive. **Net-new prereq surfaced by
grounding:** `callconv(.naked)` (only `.default`/`.c` today). **Generic enums dropped**
— `RecvResult($T)`/`TryResult($T)` are **type-fns over `reify`** (type-fns already work
in type position, e.g. `Make`/`Complex`), so no `enum($T)` feature is needed; `reify`
gains two contracts (deterministic identity + functional-enum output, §7 step 3).

**Locked (see §4.6 for the grounded surface):**
- **N1 atomics surface = generic wrapper `Atomic($T)`** + `Ordering` enum, `.init`,
  `compare_exchange`/`_weak` returning `?T` (**null = success** — pinned, opposite of
  most priors). (Not `@atomic_*` builtins — `@` is address-of in sx.) **RMW set** =
  `add/sub/and/or/xor/swap` + `fetch_min`/`fetch_max` (free from LLVM); **no `nand`**.
- **`race` = over futures** (Zig model), **single named-tuple in** (`race((a: fa, b:
  fb))`) → synthesized tagged-union out; Go-style handler-map + map literal
  **dropped**. **No `async` spawn-sugar** — always `context.io.async(...)`.
- **Channels** = `send`/`recv` methods (no `<-`); **`recv` returns a tagged union**
  `RecvResult($T){ value; closed }` (not `(v, ok)`), `try_recv` → `{ value; empty;
  closed }`; optional `for ch (v) {…}` iteration sugar. **locks** = `lock()` + `defer
  unlock()` (no guard sugar). `race`/`async`/`await` stay library, not keywords.
- **Comptime type metaprogramming = `type_info` + `reify` builtins only** (Zig
  `@typeInfo`/`@Type` model). **Everything else is sx lib** — `make_enum`,
  `field_type`, `RaceResult`. `reify` coverage starts at **enum/struct/tuple**, grows
  later. `Future($T)` exposes `Value :: T` so `Future(X)→X` is plain member access
  (no `type_arg` builtin).
- **C1 FFI engine = LLVM as single ABI authority** — per-signature JIT calling-thunks
  via S1 (LLVM emits the ABI-correct call, same as runtime codegen); trampoline
  fast-path for trivial calls. **libffi/dyncall + hand-rolled-sx rejected** (2nd/3rd
  ABI impl; hand-rolled needs C3 for its own asm leaf anyway). Promotes **S1 to
  foundational** (shared by C1, C3).

**Scheduler (Decision 5) — locked:** **M:1 → N×(M:1) → M:N**, all **sx std-lib `Io`
vtables** (compiler only provides N1 atomics + the A2 asm context-switch + `extern`
syscalls). M:1 ships first (validates the colorblind stack, covers I/O-bound);
N×(M:1) is the first parallel step; **M:N is last in sequence but committed — not
deferred.** Data races under parallelism are expected and handled with atomics +
fiber-aware sync — that *is* parallelism, not a wart; M:1's lock-freedom is just a
property of the single-threaded case.

**Deferred, orthogonal additions (Decisions 6–7) — both addable later without
revisiting anything locked:**
- **C4 (Decision 6) — fully orthogonal; not built now.** Pure deferred optimization
  riding S1 (already present for C1/C3): JIT the bundler subgraph instead of
  interpreting it. Zero coupling — same bundler sx, same C1 FFI. Apply only if
  profiling ever shows the bundler's *own logic* is a hotspot (it's I/O-bound, so
  unlikely). Interp+C1 is the shipping bundler.
- **Hot-reload (Decision 7) — deferred; mechanism additive.** Substrate ready: R1
  (dylib-swap) needs only shipped `export`; R2 (JIT-resident) needs S1 + the S2 ORC
  shim. **R1-vs-R2 chosen at pickup.** One coupling (a design constraint, not a
  decision change): you can't hot-swap code with **live suspended fibers** pointing
  into the old module — so the async runtime + long-lived fibers stay on the
  *persistent* side, only transient **leaf logic** is reloadable (or quiesce fibers
  before swap).

---

## 10. Testing & gates

Inherits the project cadence (CLAUDE.md): `zig build && zig build test` after every
step; **xfail-then-green or behavior-lock — no commit both adds a test AND makes it
pass**; never regenerate snapshots while red; corpus = `examples/` + `issues/` with
`.exit`/`.stdout`/`.stderr`/`.ir` snapshots. Per-*step* gates live in the eventual
`PLAN-*` streams; this section is the design-level verification strategy that those
streams must implement.

### 10.1 The async test harness = the deterministic-simulation `Io` (the keystone)

Concurrency is nondeterministic (scheduling/readiness order), which **breaks snapshot
testing** outright. So the **deterministic-sim `Io`** (fixed clock, scripted
readiness, deterministic single-stepping scheduler) is not merely a feature — it is
**the test harness for everything async**. Every concurrency example runs under it →
reproducible stdout → snapshottable. Consequence for sequencing: **build the
deterministic `Io` right after the blocking `Io`** (it's the simplest scheduler after
blocking and it *gates the ability to test* fibers/channels/race/schedulers at all).
The 10 patterns in §4.6-adjacent examples become corpus tests only because they run
under it.

### 10.2 What is NOT snapshot-testable

True parallel **data races** (N×M:1 / M:N) are nondeterministic by construction. They
run under the deterministic `Io` for *correctness* repro, but race-detection needs a
separate **stress harness** (run-N-times / TSan-style), **not** the corpus. Any such
coverage bound must be stated loudly (a `log()`-style note in the harness), never
silently skipped — per the REJECTED-PATTERNS rule against silent gaps.

### 10.3 Arch-sensitive lowering — atomics + context-switch

Atomic orderings lower differently per arch (x86 `lock`-prefix / plain MOV vs aarch64
LL/SC / `ldar`/`stlr`), and the A2 context-switch is per-arch asm. Lock both with the
**existing inline-asm cross-arch sibling pattern**: a `.build` `{"target": "…"}`
sidecar runs **ir-only** on a non-matching host (asserts `.ir` + `.exit` + `.stderr`
from `sx ir --target`) and **end-to-end** on a matching CI runner. So `Atomic`
lowering carries **x86_64 + aarch64 `.ir`** snapshots; the context-switch gets
per-arch run tests on matching runners.

### 10.4 New corpus categories

`17xx` atomics · `18xx` concurrency (fibers/channels/race/async, all under the
deterministic `Io`). Comptime metaprogramming (`type_info`/`reify`) + comptime-asm
extend `06xx`; C1 FFI extends `12xx`; the cross-arch comptime-asm **loud bail** and
the cancellation diagnostics are `11xx`.

### 10.5 Per-piece gates (design level)

| Piece | Locks via |
|---|---|
| **N1 atomics** | unit `emit_llvm.test.zig` (LLVM `atomicrmw`/`cmpxchg`/`fence` + ordering emission); corpus `17xx` single-thread (deterministic); arch-gated `.ir` (x86_64 + aarch64) |
| **type_info / reify** | unit (reflect round-trips; reify'd enum has correct layout/match codegen); corpus `06xx` comptime (deterministic) |
| **C1 FFI** | **behavior-lock** existing trampoline cases first; then xfail→green `12xx` comptime extern with floats / structs-by-value / aggregate (`{ptr,len}`) returns; unit for thunk-synth + args-buffer marshal |
| **S1 spine** | infra — exercised transitively via C1/C3 examples; unit for LLJIT lifecycle + thunk cache |
| **C3 comptime asm** | corpus `06xx` host-arch `#run` asm computes a value; `11xx` diagnostic asserts the cross-arch loud bail |
| **A1/A2 fibers** | unit (scheduler step, fiber bootstrap); context-switch arch-gated run tests; corpus `18xx` under deterministic `Io` |
| **A3/A5 schedulers, channels, race, cancel** | corpus `18xx` (the 10 patterns) under deterministic `Io` → deterministic snapshots; cancellation cleanup (`onfail`/`defer`) asserted via stdout ordering |

### 10.6 Cadence example (atomics, N1)

1. **xfail** — add `examples/17xx-atomics-fetch-add.sx` using `Atomic(i64).fetch_add`; seed the `.exit` marker → **red** (codegen missing). *(test added, not yet passing)*
2. **green** — emit LLVM `atomicrmw add` + ordering; example passes; capture `.stdout` + x86_64/aarch64 `.ir` snapshots; review the diff. *(makes it pass, no new test)*

This satisfies "no commit both adds a test and makes it pass," and every other piece
follows the same xfail→green (or behavior-lock→extend) shape.

### 10.7 Review-surfaced gaps (the high-corruption-risk pieces need *correctness*, not existence, tests)

The §10.5 gates prove things *run*; the §8.1 risks are silent-corruption modes a
run/snapshot test won't catch. Each needs an explicit adversarial gate:

- **A2 context-switch — switch-stress test.** Scribble *every* callee-saved register
  + a stack-canary before suspend; deep/recursive fiber chains; verify all survive
  post-resume. Run/snapshot tests don't prove register preservation. (The single
  highest-corruption-risk piece, §8.1.1.)
- **Deterministic-`Io` — calibrate the oracle.** Cross-check a handful of cases
  against the blocking `Io` and property-test that scheduling order is actually fixed,
  *before* trusting it to gate everything async (a deterministic-but-wrong scheduler
  snapshots garbage).
- **`context`-fiber-local invariant — named test at the N×M:1/M:N step.** M:1 can't
  exercise migration; add a test that forces a fiber to migrate and asserts it reads
  *its* `context`/`errno`, not the new thread's.
- **N1 ordering *semantics* are out of snapshot scope — state it loudly.** `.ir`
  snapshots prove the *keyword emitted*, not weak-memory correctness (e.g. `relaxed`
  where `acquire` was needed ships green). Declare this out-of-scope parallel to
  §10.2's race carve-out; lock-free structures need the stress harness.
- **C1 args-buffer — adversarial layout cases.** Over-aligned structs, empty structs,
  aarch64 small-struct register splitting, `bool` — a wrong layout that happens to
  print right passes a stdout test. Call these out explicitly, not just
  "structs-by-value."
- **S2 — has no gate today despite a prior spike failure.** When reached, add a TLS +
  C-constructor JIT test (the exact `_Thread_local` SIGABRT case), per host OS.
- **Hot-reload — no row today.** When picked up: state-survival test + the
  live-suspended-fiber-into-stale-module hazard (R1/R2).