feat(asm): Phase C.1 + D — inline asm codegen (runs end-to-end)

lowerAsmExpr stops bailing and builds the inline_asm op: resolves each operand's effective name (§II.5 — explicit [name] else the {reg} pin), interns template/constraints/clobbers, lowers input Refs, derives the result TypeId (0→void, 1→T). Adds the last deferred validation (every %[name] must name an operand). Multi-output (N>1) bails with a named "Phase E" diagnostic. emitInlineAsm (backend/llvm/ops.zig) ports Zig's airAssembly: assembles the LLVM constraint string (outputs → inputs → ~{clobber}, ',' → '|'), rewrites the template (%[name]→${N}, %%→%, $→$$, %=→${:uid}), then LLVMGetInlineAsm + LLVMBuildCall2 (AT&T dialect). Dispatch wired in emit_llvm.zig (replacing the C.0 @panic tripwire). inferType gains an .asm_expr arm (expr_typer.zig) so a bare `x := asm {…-> T}` binding types correctly — without it the binding inferred .unresolved and silently produced 0. llvm_shim.c: LLVMInitializeNativeAsmParser() — the JIT must assemble inline asm at run time. Verified end-to-end on the aarch64 host: `mov`/`add` with register-class inputs and a value output run (exit 42/99), `nop volatile` runs (exit 0). IR is textbook: `call i64 asm "add ${0},${1},${2}", "=r,r,r"(…)`. Locked with 1645 (aarch64 add, runs; ir-only on non-aarch64) + 1646 (:= binding). Updated 1640 (now Phase-E bail) + 1642 (now runs). zig build test green (654 corpus, 446 unit).
2026-06-15 21:39:54 +03:00
parent 6c08de8ec1
commit 5a5e04c6d5
23 changed files with 395 additions and 50 deletions
--- a/current/CHECKPOINT-ASM.md
+++ b/current/CHECKPOINT-ASM.md
@@ -6,7 +6,31 @@ commit, one step at a time per the cadence rule (no commit may both add a test
 and make it pass).

 ## Last completed step
-**C.0** — IR op `inline_asm` (lock; no behavior change). Added `inline_asm:
+**C.1 + D** — inline asm CODEGEN (lowering builds the op + LLVM emit). **Inline
+assembly now runs end-to-end.** `lowerAsmExpr` (`src/ir/lower/expr.zig`) stops
+bailing: it resolves each operand's effective name (§II.5 auto-naming), interns
+template/constraints/clobbers, lowers input `Ref`s, derives the result `TypeId`
+(0→void, 1→T), and builds the `inline_asm` op. Added a `%[name]`-references-a-
+real-operand check (the last deferred validation). Multi-output (N>1) still bails
+loudly ("Phase E"). `emitInlineAsm` (`src/backend/llvm/ops.zig`, port of Zig's
+`airAssembly`): assembles the LLVM constraint string (outputs→inputs→`~{clobber}`,
+`,`→`|`), rewrites the template (`%[name]`→`${N}`, `%%`→`%`, `$`→`$$`, `%=`→
+`${:uid}`), then `LLVMGetInlineAsm` + `LLVMBuildCall2` (AT&T). Dispatch wired
+(`emit_llvm.zig`, replacing the C.0 `@panic`). **`llvm_shim.c`**: added
+`LLVMInitializeNativeAsmParser()` — the JIT must assemble inline asm at run time.
+Verified end-to-end: aarch64 `add`/`mov` run on the host (exit 42), `nop volatile`
+runs (1642 now exit 0), IR is textbook (`call i64 asm "add ${0},${1},${2}",
+"=r,r,r"(…)`). Locked with `examples/1645-platform-asm-aarch64-add.sx` (runs on
+aarch64, ir-only elsewhere via `.build` + `.ir`). Also added the `inferType`
+`.asm_expr` arm (`src/ir/expr_typer.zig`, 0→void / 1→T) — without it a bare
+`x := asm {…-> T}` binding inferred `.unresolved` and silently produced 0;
+regression-locked with `examples/1646-platform-asm-value-binding.sx`. Updated
+1640 (now Phase-E bail) + 1642 (now runs). `zig build test` green (654 corpus,
+446 unit). Files: `src/ir/lower/expr.zig`, `src/backend/llvm/ops.zig`,
+`src/ir/emit_llvm.zig`, `src/ir/expr_typer.zig`, `llvm_shim.c`,
+`examples/164{0,2,5,6}-*`.
+
+Prior: **C.0** — IR op `inline_asm` (lock; no behavior change). Added `inline_asm:
 InlineAsm` to the IR `Op` union + the `InlineAsm` struct (`template: StringId`,
 `operands: []const AsmOperand` {role/name/constraint/operand}, `clobbers:
 []const StringId`, `has_side_effects`) in `src/ir/inst.zig` — all strings
@@ -88,40 +112,40 @@ guards fire: corrupting the `.ir` → IR mismatch; deleting it → the require-f
 `src/corpus_run.test.zig`, `examples/1639-*`.

 ## Current state
-Phase A underway: `asm { … }` lexes (A.0) and **parses** into `AsmExpr` (A.1);
-lowering bails LOUD + named (no IR op / emit yet). Result-type derivation, the
-operand auto-naming rule, and the validation checklist are **Phase B** (not yet
-implemented — any asm reaching lowering errors out). The adopted **operand
-auto-naming rule** (design §II.5, decided this session): name auto-derived from a
-`{reg}` pin; explicit `[name]` only when it differs or for register-class (`=r`)
-operands; echo form `[eax] "={eax}"` rejected. Parser stores `name: ?[]const u8`;
-the rule is a Phase-B (typing) concern, so the parser needs no change for it.
+**Inline assembly works end-to-end for 0/1 value outputs.** Pipeline complete:
+lex (A.0) → parse (A.1) → validate (B.0/B.1 + the `%[name]` check) → IR op (C.0)
+→ lower-builds-op + LLVM emit + JIT asm-parser init (C.1/D). Single-value-output
+and no-output `volatile` asm assemble and execute on the host JIT; the auto-naming
+rule (§II.5) is live (effective name = explicit `[name]` else `{reg}`). **Phase E
+(multi-output tuples) is the remaining feature gap** — N>1 value outputs bail with
+a named "Phase E" diagnostic (1640). `-> @place` write-through outputs are still
+rejected at parse (Phase 2). Global asm (Phase F) not started.

 Known orthogonal bug: **issue 0137** — `sx run` on a program with no `main`
 segfaults (`src/target.zig:256-273`, unguarded JIT entry lookup). Pre-existing,
 asm-independent; does NOT block the ASM stream (every example has a `main`).

-Phase B–E feasibility already confirmed against the live tree
+Phase E–F feasibility already confirmed against the live tree
 (`LLVMGetInlineAsm` / `LLVMBuildCall2` / `LLVMAppendModuleInlineAsm` in LLVM@19
 `Core.h`; ERR-stream `extractvalue`→tuple in `emit_llvm.zig:726-927`; lib-less
 `extern`, 60 sites; `--target` a global CLI flag).

 ## Next step
-**C.1 + D together** (must land as one green step) — wire `lowerAsmExpr` to BUILD
-the `inline_asm` op (intern template + constraints + clobber names; resolve each
-operand's effective name via the §II.5 auto-naming rule; lower input `Ref`s;
-compute the result `TypeId` from the `out_value` operands — 0→void, 1→T, N→tuple,
-named) AND implement `emitInlineAsm` in `src/ir/emit_llvm.zig` (replacing the
-`@panic` tripwire) — the port of Zig's `airAssembly`: assemble the LLVM constraint
-string (outputs `=`/`+`, inputs, `clobbers`→`~{name}`), rewrite `%[name]`→`${N}` /
-`%%` / `%=`, `LLVMGetInlineAsm` + `LLVMBuildCall2`, AT&T dialect. They land
-together because the moment lowering stops bailing, emit is reached — a half-step
-would hit the tripwire. First target: the single-value-output syscall on
-`x86_64-linux` (ir-only via a `.build` `{ "target": "x86_64-linux" }` + `.ir`
-snapshot, since the host is aarch64). Result-type derivation for `expr_typer.zig`
-(`inferType` `.asm_expr` arm) also lands here — now observable. Then E (multi-
-return tuples) + remaining validation (`%[name]` references a real operand). See
-`PLAN-ASM.md` Phases C–E + design §II.6.
+**Phase E** (multi-output tuples) — replace the N>1 "Phase E" bail in
+`lowerAsmExpr`: build a tuple `TypeId` from the `out_value` types (named via the
+effective-name rule), set it as the op result, and in `emitInlineAsm` make the
+LLVM return type an anonymous struct `{T1,…,Tn}`, then `extractvalue i` per
+`out_value` → assemble the sx tuple. Lock with `divmod`→`(quot,rem)` (reuse 1640's
+shape, now running) + `cpuid`→4-tuple, arch-pinned. See `PLAN-ASM.md` Phase E +
+design §II.6 (multi-return). Also worth adding: the x86_64-linux syscall-write
+example (ir-only on this host via `.build { "target": "x86_64-linux" }` + `.ir`)
+to lock the cross-target lowering, per the plan's D verification.
+
+Then Phase 2 (`-> @place` write-through / read-write / indirect-memory) and Phase
+F (global asm + `extern` call into asm symbols). Result-type derivation for the
+0/1 cases now lives in BOTH `lowerAsmExpr` (the op's `Inst.ty`) and
+`expr_typer.zig`'s `inferType` (for `:=`/value-position typing); Phase E extends
+both to the tuple case.

 ## Log
 - (init) Plan + design doc written; ASM stream opened.
@@ -151,6 +175,12 @@ return tuples) + remaining validation (`%[name]` references a real operand). See
 - (C.0) IR op `inline_asm: InlineAsm` + interp `bailDetail` + print arm + emit
  `@panic` tripwire (Phase D). No behavior change (lowering still bails). Unit
  test `inline_asm op shape`. `zig build test` green (652 corpus, 446 unit).
+- (C.1+D) CODEGEN — `lowerAsmExpr` builds the op (effective names, interned
+  strings, input Refs, 0/1 result type) + `%[name]` validation; `emitInlineAsm`
+  (constraint string + template rewrite + `LLVMGetInlineAsm`/`BuildCall2`, AT&T);
+  `inferType` arm; `LLVMInitializeNativeAsmParser` for the JIT. **Inline asm runs
+  end-to-end.** N>1 bails (Phase E). Locked with 1645 (aarch64 add, runs) + 1646
+  (`:=` binding); updated 1640/1642. `zig build test` green (654 corpus, 446 unit).

 ## Known issues
 - **0137** — `sx run` on a program with no `main` segfaults (unguarded JIT entry
--- a/examples/1640-platform-asm-parse.sx
+++ b/examples/1640-platform-asm-parse.sx
@@ -1,10 +1,10 @@
-// ASM stream Phase A.1 — `asm { … }` PARSES into an AsmExpr: template, named
-// value outputs (`[quot] "={rax}" -> u64`), register-pinned inputs, and a
-// `clobbers(.…)` clause are all accepted with no parse error. Codegen is not
-// implemented yet (the IR op + LLVM emit land in Phases C–E), so lowering bails
-// LOUD + named. This example pins that intermediate diagnostic; a later phase
-// turns it into a running multi-return example. Called from `main` so lowering
-// actually reaches the asm body (lazy lowering skips uncalled functions).
+// ASM stream — `asm { … }` parses + validates the full rich shape: named value
+// outputs (`[quot] "={rax}" -> u64`), register-pinned inputs, and a
+// `clobbers(.…)` clause, all accepted. This is a MULTI-output (tuple-returning)
+// asm, which is deferred to Phase E — so lowering bails LOUD + named with the
+// specific "Phase E" diagnostic (single-output asm already runs; see 1645).
+// Called from `main` so lowering reaches the asm body (lazy lowering skips
+// uncalled functions).
 divmod :: (n: u64, d: u64) -> (quot: u64, rem: u64) {
    return asm {
        "divq %[d]",
--- a/examples/1642-platform-asm-nop-volatile.sx
+++ b/examples/1642-platform-asm-nop-volatile.sx
@@ -1,5 +1,5 @@
-// ASM stream Phase B — the no-output form IS accepted when `volatile` is
-// present: validation passes, and lowering then bails on the not-yet-
-// implemented codegen (Phases C–E). Confirms the volatile rule's positive side.
+// ASM stream — the no-output `volatile` form runs end-to-end: a bare `nop`
+// (no operands, no result) assembles and executes cleanly (exit 0). Confirms
+// the no-output⇒volatile rule's positive side AND the zero-operand emit path.
 nop :: () { asm volatile { "nop" }; }
 main :: () { nop(); }
--- a/examples/1645-platform-asm-aarch64-add.sx
+++ b/examples/1645-platform-asm-aarch64-add.sx
@@ -0,0 +1,10 @@
+// ASM stream Phase D — inline assembly that RUNS end-to-end. An aarch64 `add`
+// with two register-class inputs (`%[a]`, `%[b]`) and a value output (`%[out]`)
+// returned from the function. The `.build` pins aarch64-macOS: on a matching
+// host the runner executes it (exit 42); elsewhere it falls to ir-only mode and
+// asserts the `.ir` snapshot (the inline_asm op + LLVM `call asm` are target-
+// independent in the IR text). Regression for the full lower→emit→JIT path.
+add_asm :: (a: i64, b: i64) -> i64 {
+    return asm { "add %[out], %[a], %[b]", [out] "=r" -> i64, [a] "r" = a, [b] "r" = b };
+}
+main :: () -> i64 { return add_asm(40, 2); }
--- a/examples/1646-platform-asm-value-binding.sx
+++ b/examples/1646-platform-asm-value-binding.sx
@@ -0,0 +1,9 @@
+// ASM stream Phase D — a bare `x := asm { … -> T }` binding (not a direct
+// `return asm`) types correctly: the value output flows through the local and
+// out as the exit code. Regression for the `inferType` `.asm_expr` arm (without
+// it the binding inferred `.unresolved` and silently produced 0). aarch64-pinned
+// via `.build` → runs on a matching host, ir-only elsewhere.
+main :: () -> i64 {
+    x := asm { "mov %[out], #99", [out] "=r" -> i64 };
+    return x;
+}
--- a/examples/expected/1640-platform-asm-parse.stderr
+++ b/examples/expected/1640-platform-asm-parse.stderr
@@ -1,4 +1,4 @@
-error: inline assembly codegen is not yet implemented (ASM stream: lowering + emit land in Phases C–E)
+error: multi-output (tuple-returning) inline assembly is not yet implemented (ASM stream Phase E)
  --> examples/1640-platform-asm-parse.sx:9:12
   |
 9 |     return asm {
--- a/examples/expected/1642-platform-asm-nop-volatile.exit
+++ b/examples/expected/1642-platform-asm-nop-volatile.exit
@@ -1 +1 @@
-1
+0
--- a/examples/expected/1642-platform-asm-nop-volatile.stderr
+++ b/examples/expected/1642-platform-asm-nop-volatile.stderr
@@ -1,5 +1 @@
-error: inline assembly codegen is not yet implemented (ASM stream: lowering + emit land in Phases C–E)
-  --> examples/1642-platform-asm-nop-volatile.sx:4:13
-   |
- 4 | nop :: () { asm volatile { "nop" }; }
-   |             ^^^^^^^^^^^^^^^^^^^^^^
+
--- a/examples/expected/1645-platform-asm-aarch64-add.build
+++ b/examples/expected/1645-platform-asm-aarch64-add.build
@@ -0,0 +1 @@
+{ "target": "macos" }
--- a/examples/expected/1645-platform-asm-aarch64-add.exit
+++ b/examples/expected/1645-platform-asm-aarch64-add.exit
@@ -0,0 +1 @@
+42
--- a/examples/expected/1645-platform-asm-aarch64-add.ir
+++ b/examples/expected/1645-platform-asm-aarch64-add.ir
@@ -0,0 +1,21 @@
+
+; Function Attrs: nounwind
+define internal i64 @add_asm(i64 %0, i64 %1) #0 {
+entry:
+  %alloca = alloca i64, align 8
+  store i64 %0, ptr %alloca, align 8
+  %allocaN = alloca i64, align 8
+  store i64 %1, ptr %allocaN, align 8
+  %load = load i64, ptr %alloca, align 8
+  %loadN = load i64, ptr %allocaN, align 8
+  %asm = call i64 asm "add ${0}, ${1}, ${2}", "=r,r,r"(i64 %load, i64 %loadN)
+  ret i64 %asm
+}
+
+; Function Attrs: nounwind
+define i32 @main() #0 {
+entry:
+  %call = call i64 @add_asm(i64 40, i64 2)
+  %ca.tr = trunc i64 %call to i32
+  ret i32 %ca.tr
+}
--- a/examples/expected/1645-platform-asm-aarch64-add.stderr
+++ b/examples/expected/1645-platform-asm-aarch64-add.stderr
@@ -0,0 +1 @@
+
--- a/examples/expected/1645-platform-asm-aarch64-add.stdout
+++ b/examples/expected/1645-platform-asm-aarch64-add.stdout
@@ -0,0 +1 @@
+
--- a/examples/expected/1646-platform-asm-value-binding.build
+++ b/examples/expected/1646-platform-asm-value-binding.build
@@ -0,0 +1 @@
+{ "target": "macos" }
--- a/examples/expected/1646-platform-asm-value-binding.exit
+++ b/examples/expected/1646-platform-asm-value-binding.exit
@@ -0,0 +1 @@
+99
--- a/examples/expected/1646-platform-asm-value-binding.ir
+++ b/examples/expected/1646-platform-asm-value-binding.ir
@@ -0,0 +1,11 @@
+
+; Function Attrs: nounwind
+define i32 @main() #0 {
+entry:
+  %asm = call i64 asm "mov ${0}, #99", "=r"()
+  %alloca = alloca i64, align 8
+  store i64 %asm, ptr %alloca, align 8
+  %load = load i64, ptr %alloca, align 8
+  %ca.tr = trunc i64 %load to i32
+  ret i32 %ca.tr
+}
--- a/examples/expected/1646-platform-asm-value-binding.stderr
+++ b/examples/expected/1646-platform-asm-value-binding.stderr
@@ -0,0 +1 @@
+
--- a/examples/expected/1646-platform-asm-value-binding.stdout
+++ b/examples/expected/1646-platform-asm-value-binding.stdout
@@ -0,0 +1 @@
+
--- a/llvm_shim.c
+++ b/llvm_shim.c
@@ -14,4 +14,7 @@ void sx_llvm_init_all_targets(void) {
 void sx_llvm_init_native_target(void) {
    LLVMInitializeNativeTarget();
    LLVMInitializeNativeAsmPrinter();
+    // Required for inline assembly: the JIT must assemble the asm template at
+    // run time, which needs the target's asm parser (ASM stream Phase D).
+    LLVMInitializeNativeAsmParser();
 }
--- a/src/backend/llvm/ops.zig
+++ b/src/backend/llvm/ops.zig
@@ -24,6 +24,7 @@ const Call = ir_inst.Call;
 const CallIndirect = ir_inst.CallIndirect;
 const ObjcMsgSend = ir_inst.ObjcMsgSend;
 const JniMsgSend = ir_inst.JniMsgSend;
+const InlineAsm = ir_inst.InlineAsm;
 const BuiltinCall = ir_inst.BuiltinCall;
 const TriOp = ir_inst.TriOp;
 const Branch = ir_inst.Branch;
@@ -774,6 +775,161 @@ pub const Ops = struct {
        self.e.mapRef(result);
    }

+    /// Inline assembly (ASM stream Phase D) — the port of Zig's `airAssembly`.
+    /// Handles 0 value outputs (void) and 1 (scalar); multi-output tuples are
+    /// Phase E (lowering bails before reaching here). Builds the LLVM constraint
+    /// string, rewrites the `%[name]` template, then `LLVMGetInlineAsm` +
+    /// `LLVMBuildCall2`.
+    pub fn emitInlineAsm(self: Ops, instruction: *const Inst, a: InlineAsm) void {
+        const e = self.e;
+        const alloc = e.alloc;
+
+        var n_inputs: usize = 0;
+        for (a.operands) |op| {
+            if (op.role == .input) n_inputs += 1;
+        }
+
+        // Result LLVM type: void (no value output) or the single scalar.
+        const ret_ty = if (instruction.ty == .void) e.cached_void else e.toLLVMType(instruction.ty);
+
+        // One LLVM call param per input operand, in source order.
+        const param_types = alloc.alloc(c.LLVMTypeRef, n_inputs) catch unreachable;
+        defer alloc.free(param_types);
+        const call_args = alloc.alloc(c.LLVMValueRef, n_inputs) catch unreachable;
+        defer alloc.free(call_args);
+        {
+            var i: usize = 0;
+            for (a.operands) |op| {
+                if (op.role != .input) continue;
+                const raw_ty = e.argIRTypeOrFail(op.operand);
+                const llvm_ty = e.toLLVMType(raw_ty);
+                param_types[i] = llvm_ty;
+                call_args[i] = e.coerceArg(e.resolveRef(op.operand), llvm_ty);
+                i += 1;
+            }
+        }
+
+        // ── Constraint string: outputs first, then inputs, then ~{clobber}. ──
+        var cons: std.ArrayList(u8) = .empty;
+        defer cons.deinit(alloc);
+        self.appendAsmConstraints(&cons, a, false); // outputs (out_value / out_place)
+        self.appendAsmConstraints(&cons, a, true); // inputs
+        for (a.clobbers) |cl| {
+            if (cons.items.len != 0) cons.append(alloc, ',') catch unreachable;
+            cons.appendSlice(alloc, "~{") catch unreachable;
+            cons.appendSlice(alloc, e.ir_mod.types.getString(cl)) catch unreachable;
+            cons.append(alloc, '}') catch unreachable;
+        }
+
+        // ── Template rewrite: %[name]->${N}, %%->%, $->$$, %=->${:uid}. ──
+        var rendered: std.ArrayList(u8) = .empty;
+        defer rendered.deinit(alloc);
+        self.renderAsmTemplate(&rendered, a);
+
+        const fn_ty = c.LLVMFunctionType(ret_ty, param_types.ptr, @intCast(n_inputs), 0);
+        const asm_val = c.LLVMGetInlineAsm(
+            fn_ty,
+            rendered.items.ptr,
+            rendered.items.len,
+            cons.items.ptr,
+            cons.items.len,
+            @intFromBool(a.has_side_effects),
+            0, // IsAlignStack
+            c.LLVMInlineAsmDialectATT,
+            0, // CanThrow
+        );
+        const label: [*:0]const u8 = if (instruction.ty == .void) "" else "asm";
+        const result = c.LLVMBuildCall2(e.builder, fn_ty, asm_val, call_args.ptr, @intCast(n_inputs), label);
+        // Always mapRef — the IR Ref counter advances regardless of result type.
+        e.mapRef(result);
+    }
+
+    /// Append the constraint fragments for one role group (outputs or inputs),
+    /// comma-separated, with each operand's `,` rewritten to LLVM's `|`
+    /// (alternative-constraint separator). Mirrors `FuncGen.airAssembly`.
+    fn appendAsmConstraints(self: Ops, cons: *std.ArrayList(u8), a: InlineAsm, inputs: bool) void {
+        const e = self.e;
+        const alloc = e.alloc;
+        for (a.operands) |op| {
+            const is_input = op.role == .input;
+            if (is_input != inputs) continue;
+            if (cons.items.len != 0) cons.append(alloc, ',') catch unreachable;
+            const s = e.ir_mod.types.getString(op.constraint);
+            for (s) |ch| cons.append(alloc, if (ch == ',') '|' else ch) catch unreachable;
+        }
+    }
+
+    /// The positional index of a named operand in the LLVM operand list
+    /// (outputs first, then inputs) — the `N` in `%[name]` → `${N}`. Lowering
+    /// guarantees every `%[name]` names an operand, so callers can assume a hit.
+    fn asmOperandIndex(self: Ops, a: InlineAsm, name: []const u8) ?usize {
+        const e = self.e;
+        var idx: usize = 0;
+        for ([_]bool{ false, true }) |inputs| {
+            for (a.operands) |op| {
+                const is_input = op.role == .input;
+                if (is_input != inputs) continue;
+                if (op.name != .empty and std.mem.eql(u8, e.ir_mod.types.getString(op.name), name)) return idx;
+                idx += 1;
+            }
+        }
+        return null;
+    }
+
+    /// Rewrite the asm template into LLVM form. State machine over the bytes:
+    /// `$`→`$$`, `%%`→`%`, `%=`→`${:uid}`, `%[name]`→`${N}`, `%[name:mod]`→
+    /// `${N:mod}`. Port of `FuncGen.zig`'s template rewriter.
+    fn renderAsmTemplate(self: Ops, out: *std.ArrayList(u8), a: InlineAsm) void {
+        const e = self.e;
+        const alloc = e.alloc;
+        const tmpl = e.ir_mod.types.getString(a.template);
+        var i: usize = 0;
+        while (i < tmpl.len) {
+            const ch = tmpl[i];
+            if (ch == '$') {
+                out.appendSlice(alloc, "$$") catch unreachable;
+                i += 1;
+                continue;
+            }
+            if (ch == '%' and i + 1 < tmpl.len) {
+                const nxt = tmpl[i + 1];
+                if (nxt == '%') {
+                    out.append(alloc, '%') catch unreachable;
+                    i += 2;
+                    continue;
+                }
+                if (nxt == '=') {
+                    out.appendSlice(alloc, "${:uid}") catch unreachable;
+                    i += 2;
+                    continue;
+                }
+                if (nxt == '[') {
+                    const close = std.mem.indexOfScalarPos(u8, tmpl, i + 2, ']').?; // lowering validated
+                    var name = tmpl[i + 2 .. close];
+                    var modifier: ?[]const u8 = null;
+                    if (std.mem.indexOfScalar(u8, name, ':')) |colon| {
+                        modifier = name[colon + 1 ..];
+                        name = name[0..colon];
+                    }
+                    const idx = self.asmOperandIndex(a, name).?; // lowering validated
+                    var buf: [16]u8 = undefined;
+                    const ds = std.fmt.bufPrint(&buf, "{d}", .{idx}) catch unreachable;
+                    out.appendSlice(alloc, "${") catch unreachable;
+                    out.appendSlice(alloc, ds) catch unreachable;
+                    if (modifier) |m| {
+                        out.append(alloc, ':') catch unreachable;
+                        out.appendSlice(alloc, m) catch unreachable;
+                    }
+                    out.append(alloc, '}') catch unreachable;
+                    i = close + 1;
+                    continue;
+                }
+            }
+            out.append(alloc, ch) catch unreachable;
+            i += 1;
+        }
+    }
+
    pub fn emitCall(self: Ops, instruction: *const Inst, call_op: Call) void {
        // Evaluate comptime functions at compile time
        const callee_func = &self.e.ir_mod.functions.items[call_op.callee.index()];
--- a/src/ir/emit_llvm.zig
+++ b/src/ir/emit_llvm.zig
@@ -1563,11 +1563,7 @@ pub const LLVMEmitter = struct {
            // ── Calls ─────────────────────────────────────────────
            .objc_msg_send => |msg| self.ops().emitObjcMsgSend(instruction, msg),
            .jni_msg_send => |msg| self.ops().emitJniMsgSend(instruction, msg),
-            // Tripwire (ASM stream): the IR op exists (Phase C.0) but emit lands
-            // in Phase D. Until then `lowerAsmExpr` still bails, so no inline_asm
-            // op is ever created — reaching here means lowering switched over
-            // before emit was ready. Crash loudly rather than miscompile.
-            .inline_asm => @panic("inline_asm reached LLVM emit before Phase D — lowering must still bail until emitInlineAsm lands"),
+            .inline_asm => |a| self.ops().emitInlineAsm(instruction, a),
            .call => |call_op| self.ops().emitCall(instruction, call_op),
            .call_indirect => |call_op| self.ops().emitCallIndirect(instruction, call_op),

--- a/src/ir/expr_typer.zig
+++ b/src/ir/expr_typer.zig
@@ -398,6 +398,22 @@ pub const ExprTyper = struct {
                }
                break :blk self.l.inferExprType(nc.rhs);
            },
+            // Inline asm result type from the `out_value` operands: 0 → void,
+            // 1 → that operand's type. N>1 (tuple) is Phase E → `.unresolved`
+            // here (lowering bails on it anyway). Mirrors `lowerAsmExpr`, so a
+            // bare `x := asm {…-> T}` binding types correctly.
+            .asm_expr => |ae| blk: {
+                var n_out: usize = 0;
+                var first_out: ?*Node = null;
+                for (ae.operands) |op| {
+                    if (op.role != .out_value) continue;
+                    n_out += 1;
+                    if (first_out == null) first_out = op.payload;
+                }
+                if (n_out == 0) break :blk .void;
+                if (n_out == 1) break :blk self.l.resolveTypeWithBindings(first_out.?);
+                break :blk .unresolved;
+            },
            // Statements don't produce values (`.return_stmt` is handled above
            // as `.noreturn` — it diverges rather than yielding `void`).
            .assignment, .var_decl, .const_decl, .fn_decl,
--- a/src/ir/lower/expr.zig
+++ b/src/ir/lower/expr.zig
@@ -2261,9 +2261,98 @@ pub fn lowerAsmExpr(self: *Lowering, ae: *const ast.AsmExpr, span: ast.Span) Ref
        return self.emitPlaceholder("inline_asm");
    }

-    // Shape is valid — codegen just isn't implemented yet (Phases C–E).
-    diags.addFmt(.err, span, "inline assembly codegen is not yet implemented (ASM stream: lowering + emit land in Phases C–E)", .{});
+    // (4) Every `%[name]` in the template must name an operand (effective name:
+    // explicit `[name]` or auto-derived register). Caught here so emit's
+    // template rewriter never sees an unknown reference. §II.6.
+    {
+        const tmpl = ae.template.data.string_literal.raw;
+        var i: usize = 0;
+        while (i < tmpl.len) : (i += 1) {
+            if (tmpl[i] != '%' or i + 1 >= tmpl.len) continue;
+            const nxt = tmpl[i + 1];
+            if (nxt == '%' or nxt == '=') {
+                i += 1;
+                continue;
+            }
+            if (nxt != '[') continue;
+            const close = std.mem.indexOfScalarPos(u8, tmpl, i + 2, ']') orelse {
+                diags.addFmt(.err, span, "unterminated `%[` in asm template", .{});
                return self.emitPlaceholder("inline_asm");
+            };
+            var ref_name = tmpl[i + 2 .. close];
+            if (std.mem.indexOfScalar(u8, ref_name, ':')) |colon| ref_name = ref_name[0..colon];
+            var found = false;
+            for (ae.operands) |op| {
+                const eff = op.name orelse (pinnedRegister(op.constraint) orelse "");
+                if (eff.len != 0 and std.mem.eql(u8, eff, ref_name)) {
+                    found = true;
+                    break;
+                }
+            }
+            if (!found) {
+                diags.addFmt(.err, span, "asm template references `%[{s}]` but no operand is named `{s}`", .{ ref_name, ref_name });
+                return self.emitPlaceholder("inline_asm");
+            }
+            i = close;
+        }
+    }
+
+    // ── Build the IR op (C.1). D emits 0 or 1 value output; N>1 (tuple result)
+    // is Phase E — bail loudly until then. ──
+    var n_value_outputs: usize = 0;
+    for (ae.operands) |op| {
+        if (op.role == .out_value) n_value_outputs += 1;
+    }
+    if (n_value_outputs > 1) {
+        diags.addFmt(.err, span, "multi-output (tuple-returning) inline assembly is not yet implemented (ASM stream Phase E)", .{});
+        return self.emitPlaceholder("inline_asm");
+    }
+
+    // Result type: 0 outputs → void; 1 → that operand's resolved type. (The
+    // resolver diagnoses an unresolvable type and returns `.unresolved`.)
+    var result_ty: TypeId = .void;
+    for (ae.operands) |op| {
+        if (op.role == .out_value) {
+            result_ty = self.resolveTypeWithBindings(op.payload);
+            break;
+        }
+    }
+    if (result_ty == .unresolved) return self.emitPlaceholder("inline_asm");
+
+    // IR operands, in source order (= `%N` index space + LLVM operand order).
+    const ir_ops = self.alloc.alloc(inst_mod.InlineAsm.AsmOperand, ae.operands.len) catch unreachable;
+    for (ae.operands, 0..) |op, i| {
+        // Effective name (design §II.5): explicit `[name]`, else auto-derived
+        // from a `{reg}` pin, else anonymous (`.empty`).
+        const eff_name: []const u8 = op.name orelse (pinnedRegister(op.constraint) orelse "");
+        ir_ops[i] = .{
+            .role = switch (op.role) {
+                .out_value => .out_value,
+                .out_place => .out_place,
+                .input => .input,
+            },
+            .name = if (eff_name.len == 0) types.StringId.empty else self.module.types.internString(eff_name),
+            .constraint = self.module.types.internString(op.constraint),
+            // input → the lowered value Ref; an output yields its value (none).
+            .operand = if (op.role == .input) self.lowerExpr(op.payload) else Ref.none,
+        };
+    }
+
+    const ir_clobbers = self.alloc.alloc(types.StringId, ae.clobbers.len) catch unreachable;
+    for (ae.clobbers, 0..) |cl, i| {
+        ir_clobbers[i] = self.module.types.internString(cl);
+    }
+
+    // Template text RAW — no sx escape processing (matches `#string` literal
+    // bytes; the `%[name]`/`%%`/`$` rewrite happens at emit). §II.11.
+    const template_text = ae.template.data.string_literal.raw;
+
+    return self.builder.emit(.{ .inline_asm = .{
+        .template = self.module.types.internString(template_text),
+        .operands = ir_ops,
+        .clobbers = ir_clobbers,
+        .has_side_effects = ae.is_volatile,
+    } }, result_ty);
 }

 /// If `node` names a `for xs: (*x)` by-ref capture (an `*elem`), returns