fix(0109): hoist all per-instruction allocas to the function entry block

An alloca built at its use site re-executes on every pass through that
block, and LLVM reclaims allocas only at ret — so loop-body locals,
nested-loop index slots, and emitter spill temps (ig.tmp, sret slots, ABI
coercion temps, byval materialization) grew the stack per iteration and
long loops segfaulted on stack exhaustion.

New LLVMEmitter.buildEntryAlloca inserts after existing entry-block
allocas and restores the builder position; every LLVMBuildAlloca site
reachable during instruction emission now routes through it.
Initialization stores stay at the use site (per-iteration re-init is
unchanged), and entry slots become mem2reg-promotable. The 35 .ir
snapshot diffs are pure alloca position moves (type multisets verified
identical per file).

Regression: examples/0047-basic-loop-local-stack-reuse.sx (segfaulted
pre-fix on both the 1M-iteration body-local loop and the 3M-iteration
nested loop).
This commit is contained in:
agra
2026-06-10 17:27:11 +03:00
parent e81780e32e
commit 878c4226a6
43 changed files with 1661 additions and 1468 deletions

View File

@@ -0,0 +1,136 @@
# RESOLVED — 0109: allocas inside loop bodies accumulate stack per iteration
**Root cause:** `emitAlloca` (and ~18 sibling `LLVMBuildAlloca` temp sites in the
LLVM backend) built allocas at the builder's current position. An alloca inside a
loop body re-executes per iteration and LLVM reclaims allocas only at `ret`, so
the frame grew with the trip count — body locals, nested-loop index slots, and
spill temps (`ig.tmp` etc.) all segfaulted long loops on stack exhaustion.
**Fix:** new `LLVMEmitter.buildEntryAlloca` (src/ir/emit_llvm.zig) builds every
per-instruction alloca in the function's entry block (after existing entry
allocas, builder position restored); all `LLVMBuildAlloca` sites reachable
during instruction emission in src/backend/llvm/ops.zig, src/backend/llvm/abi.zig
and src/ir/emit_llvm.zig route through it. Initialization stores stay at the
use site, so per-iteration re-init semantics are unchanged; entry-block slots
are also mem2reg-promotable. ~35 `.ir` snapshots churned (pure alloca position
moves — verified type-multiset-identical per file).
**Regression test:** `examples/0047-basic-loop-local-stack-reuse.sx` (1M-iteration
body-local loop prints `sum=499999500000`; 3M-iteration nested loop prints
`n=3000000`; both segfaulted pre-fix).
---
# 0109 — allocas inside loop bodies accumulate stack per iteration → segfault on long loops
**Symptom.** Any `alloca` that lands inside a loop's body block executes anew
on every iteration, and LLVM stack allocas are only reclaimed at function
return — so the frame grows monotonically with the trip count. Observed: a
1M-iteration loop with a body-local array segfaults (stack overflow, fault
address at the guard page); so does a 3M-iteration nested loop with **no user
locals at all** (the inner loop's hidden index slot is itself a body-block
alloca of the outer loop). Expected: loop-local storage is reused across
iterations; stack usage is static per frame regardless of trip count.
This hits three shapes, all confirmed:
1. user locals declared in a loop body (`buf : [128]s64 = ---;`),
2. nested loops (inner `for`'s `idx_slot` alloca sits in the outer body),
3. compiler temporaries spilled in the body (e.g. `index_get`'s `ig.tmp`
see issue 0110 for the for-over-array case specifically).
## Reproduction
Repro A — body local (`issues/0109-loop-body-alloca-stack-growth.sx`):
```sx
#import "modules/std.sx";
main :: () -> s32 {
sum := 0;
for 0..1000000: (i) {
buf : [128]s64 = ---;
buf[0] = i;
sum += buf[0];
}
print("sum={}\n", sum);
0
}
```
- **Observed**: `Segmentation fault at address 0x16e70ffd0` (guard page).
With `0..1000` instead it prints `sum=499500` and exits 0 — the program is
correct, only the stack accumulation kills it.
- **Expected**: prints `sum=499999500000`, exit 0, at any trip count.
Repro B — pure nested loops, zero user locals:
```sx
#import "modules/std.sx";
main :: () -> s32 {
n := 0;
for 0..3000000: (i) {
for 0..1: (j) { n += 1; }
}
print("n={}\n", n);
0
}
```
- **Observed**: segfault. **Expected**: `n=3000000`, exit 0.
The emitted IR shows the cause directly (`sx ir`, body of repro A):
```llvm
for.body.1:
%alloca2 = alloca [128 x i64], align 8 ; fresh 1KB every iteration
...
%ig.tmp = alloca [128 x i64], align 8 ; plus a 1KB spill temp
```
## Root cause (suspected area)
`Builder.alloca` (`src/ir/module.zig` ~474) emits the `.alloca` instruction
into the current block, and the LLVM emitter (`src/backend/llvm/ops.zig`
`emitAlloca` ~327) builds `LLVMBuildAlloca` at the current insertion point —
so loop-body allocas are *executed* per iteration. LLVM only treats
entry-block allocas as static frame slots (and mem2reg/SROA only promote
those); a non-entry alloca re-executes and grows the stack each time, until
`ret`.
The standard fix (what clang does): emit **all** static allocas into the
function's entry block. Least-invasive locus is the emitter — in
`emitAlloca`, save the current insertion point, position the builder at the
entry block's first non-alloca instruction (or end of entry if empty), build
the alloca there, restore the position, `mapRef` as before. The IR shape and
the interpreter are untouched. All sx allocas are statically sized (TypeId),
so every one is hoistable.
## Investigation prompt (paste into a fresh session)
> Fix issue 0109: loop-body allocas grow the stack per iteration and long
> loops segfault. In `src/backend/llvm/ops.zig` `emitAlloca` (~327), hoist the
> alloca to the current function's entry block: get the function via the
> current insert block's parent, position the builder before the entry
> block's first non-alloca instruction (`LLVMGetEntryBasicBlock` +
> `LLVMGetFirstInstruction` walk past `LLVMAlloca` opcodes — same positioning
> pattern as `injectCtorIntoMain` in `src/ir/emit_llvm.zig` ~466), build the
> alloca + `mapRef`, then restore the previous insertion point
> (`LLVMGetInsertBlock` before / `LLVMPositionBuilderAtEnd` after). Audit the
> other in-place `LLVMBuildAlloca` temporaries in `src/ir/emit_llvm.zig`
> (`ba.tmp`, `abi.tmp`, `ig.tmp`, etc. — grep `BuildAlloca`) and route the
> ones reachable inside loops through the same hoist helper.
>
> Semantics note: per-iteration re-zeroing must not regress — initialization
> stores (e.g. `store undef` / `= .{...}` inits) stay where the decl was, in
> the body block; only the `alloca` itself moves to entry.
>
> Verify: both repros in `issues/0109-loop-body-alloca-stack-growth.md` (A is
> `issues/0109-loop-body-alloca-stack-growth.sx`) now print
> `sum=499999500000` / `n=3000000` and exit 0; `sx ir` on repro A shows no
> `alloca` inside `for.body.*`. Then `zig build && zig build test && bash
> tests/run_examples.sh` — any `.ir` snapshot churn from alloca placement must
> be reviewed (`git diff examples/expected/`) before `--update`. Promote a
> trip-count-bounded variant (e.g. 200k iterations, small buf) to
> `examples/00xx-basic-loop-local-stack-reuse.sx` as the pinned regression.