F2.2: std/json reader — explicit-alloc parse with error surfacing

Add the JSON reader (parser) to library/modules/std/json.sx, the inverse
of the F2.1 writer over the same value model: insertion-ordered objects,
arrays, strings (full unescaping incl. \uXXXX + surrogate pairs), s64
integers, bool, null.

Heap discipline (binding): exactly two allocation kinds, both through the
EXPLICIT `alloc` parameter, never the implicit context allocator —
composite backing stores (Array/Object.items via add/put) and decoded
escaped-string buffers (bounded by the raw span). Un-escaped string
values are zero-copy VIEWS into the input buffer (valid only while it
lives); scalars carry no heap.

Failure surfacing (hard contract): malformed input raises a meaningful
JsonParseError variant (UnexpectedToken / UnexpectedEnd / BadEscape /
BadNumber / TrailingGarbage) on the error channel, never a bogus value.
Trailing non-whitespace is TrailingGarbage; fractions/exponents,
out-of-s64 magnitudes, and leading zeros are BadNumber. Number
accumulation runs in negative space so s64 MIN parses exactly.

examples/0714-modules-json-reader.sx asserts the parsed structure
(insertion order, every kind), proves the view-vs-decoded heap split by
pointer containment, round-trips back through the writer byte-for-byte,
decodes a surrogate-pair into 4 UTF-8 bytes, and checks every malformed
variant.

Filed issues/0078: a string `==` (or any sub-CFG operand) used in a
short-circuit `and`/`or` emits invalid LLVM IR (stale PHI predecessor),
hit while writing the example's assertions and worked around there by not
combining comparisons with `and`/`or`. src/ untouched.
This commit is contained in:
agra
2026-06-04 01:41:33 +03:00
parent 295d95d51a
commit 88be541778
6 changed files with 615 additions and 4 deletions

View File

@@ -1,12 +1,13 @@
// =====================================================================
// json.sx — JSON value model + writer (stable key order), pure sx.
// json.sx — JSON value model + writer + reader (stable key order), pure sx.
//
// This module delivers the JSON VALUE MODEL and the WRITER. The reader
// (parser) lands separately; this file never reads JSON text.
// This module delivers the JSON VALUE MODEL, the WRITER, and the READER
// (parser). The model is built once and shared by both directions.
//
// NUMBERS ARE INTEGERS ONLY (s64) for this milestone — there is no
// fraction or exponent. A JSON value is one of: null, bool, integer,
// string, array, object.
// string, array, object. The reader REJECTS a fraction or exponent
// (`error.BadNumber`) rather than silently truncating it.
//
// STABLE KEY ORDER: an object is NOT a hash map. It is an ORDERED list
// of (key, value) pairs that preserves INSERTION ORDER. Keys are never
@@ -333,3 +334,325 @@ write_to_file :: (v: Value, file: *File, staging: []u8) -> !JsonError {
try sink.flush();
return;
}
// ── Reader (parser) ───────────────────────────────────────────────────
//
// `parse(src, alloc)` turns a JSON document in `src` into the value model
// above. It is the inverse of the writer for the v0 scope: objects (in
// INSERTION ORDER), arrays, strings (with full unescaping incl. \uXXXX
// and surrogate pairs), s64 integers, bool, null.
//
// FAILURE SURFACING (hard contract): every malformed input raises on the
// error channel (`!JsonParseError`) — never a bogus or default value.
// Trailing non-whitespace after a complete value is `TrailingGarbage`.
// `pos` (the parser cursor) marks where the failure was detected.
//
// NOT SUPPORTED (rejected, not silently accepted): a fraction or exponent
// in a number (`1.5`, `1e9`) → `BadNumber`; a number outside s64 →
// `BadNumber`; a leading-zero integer (`01`) → `BadNumber`. UNESCAPED raw
// control bytes (< 0x20) inside a string are passed through verbatim (the
// minimal-reader leniency the manifest / db.json never exercise).
//
// HEAP DISCIPLINE (binding, see heap-discipline.md). Exactly two kinds of
// allocation happen, both through the EXPLICIT `alloc` parameter, never
// the implicit context allocator:
// 1. Composite backing stores — `Array.items` / `Object.items` grow via
// `arr.add(.., alloc)` / `obj.put(.., alloc)` (genuinely unbounded
// children; mirrors `List`).
// 2. DECODED strings — a string containing escapes must be un-escaped
// into fresh storage; that buffer is `alloc`-ed (bounded by the raw
// span, since every escape shrinks). A string with NO escapes is a
// zero-copy VIEW into `src`; scalars carry no heap.
//
// OWNERSHIP / LIFETIME: un-escaped string values are SLICES into `src` —
// they are valid only while `src` lives. Everything else (nodes, decoded
// strings) is owned by `alloc`; free it all by dropping that allocator
// (e.g. an Arena `deinit`). A typical caller parses under an Arena and
// keeps `src` alive for as long as the tree is used.
//
// gpa := GPA.init();
// arena := Arena.init(xx gpa, 4096);
// defer arena.deinit();
// root := parse(src, xx arena)!; // composites + decoded strings in arena
// The reader's failure contract. Meaningful variants so a caller can tell
// a truncated document from a bad escape from trailing junk.
JsonParseError :: error { UnexpectedToken, UnexpectedEnd, BadEscape, BadNumber, TrailingGarbage }
// Lowercase/uppercase hex nibble value (0..15) of an ASCII byte; a non-hex
// byte in a `\uXXXX` escape is a `BadEscape`.
hex_value :: (c: u8) -> (s64, !JsonParseError) {
if c >= 48 and c <= 57 { return (cast(s64) c) - 48; } // '0'..'9'
if c >= 97 and c <= 102 { return (cast(s64) c) - 97 + 10; } // 'a'..'f'
if c >= 65 and c <= 70 { return (cast(s64) c) - 65 + 10; } // 'A'..'F'
raise error.BadEscape;
}
// Encode code point `cp` (already validated 0..0x10FFFF, non-surrogate) as
// UTF-8 into `out`, returning the byte count (1..4). No bounds check: the
// decode buffer is sized to the raw escaped span, which always dominates.
encode_utf8 :: (cp: s64, out: [*]u8) -> s64 {
if cp < 0x80 {
out[0] = xx cp;
return 1;
}
if cp < 0x800 {
out[0] = xx (0xC0 | (cp >> 6));
out[1] = xx (0x80 | (cp & 0x3F));
return 2;
}
if cp < 0x10000 {
out[0] = xx (0xE0 | (cp >> 12));
out[1] = xx (0x80 | ((cp >> 6) & 0x3F));
out[2] = xx (0x80 | (cp & 0x3F));
return 3;
}
out[0] = xx (0xF0 | (cp >> 18));
out[1] = xx (0x80 | ((cp >> 12) & 0x3F));
out[2] = xx (0x80 | ((cp >> 6) & 0x3F));
out[3] = xx (0x80 | (cp & 0x3F));
return 4;
}
// The cursor over the input. `src` is borrowed (never written); `pos` is
// the running offset and doubles as the failure position; `alloc` is the
// EXPLICIT allocator for composites + decoded strings.
Parser :: struct {
src: string;
pos: s64 = 0;
alloc: Allocator;
// Advance past JSON whitespace (space / tab / LF / CR).
skip_ws :: (self: *Parser) {
while self.pos < self.src.len {
c := self.src[self.pos];
if c == 32 or c == 9 or c == 10 or c == 13 { self.pos += 1; }
else { break; }
}
}
// Consume an exact literal (`true` / `false` / `null`) or fail.
expect_lit :: (self: *Parser, lit: string) -> !JsonParseError {
if self.pos + lit.len > self.src.len { raise error.UnexpectedEnd; }
i := 0;
while i < lit.len {
if self.src[self.pos + i] != lit[i] { raise error.UnexpectedToken; }
i += 1;
}
self.pos += lit.len;
return;
}
// Read 4 hex digits at `i` (which must lie within [.., end)); returns
// the 16-bit value. Fewer than 4 digits before `end` is a BadEscape.
read_hex4 :: (self: *Parser, i: s64, end: s64) -> (s64, !JsonParseError) {
if i + 4 > end { raise error.BadEscape; }
v := 0;
k := 0;
while k < 4 {
v = v * 16 + (try hex_value(self.src[i + k]));
k += 1;
}
return v;
}
// Decode the escaped string body in [start, end) into `out`, returning
// the decoded byte length. Pass 1 (in parse_string) guarantees there is
// no dangling backslash, so the byte after every `\` is in range.
decode_into :: (self: *Parser, start: s64, end: s64, out: [*]u8) -> (s64, !JsonParseError) {
di := 0;
i := start;
while i < end {
c := self.src[i];
if c == 92 { // backslash
i += 1;
e := self.src[i];
if e == 34 { out[di] = 34; di += 1; i += 1; } // \"
else if e == 92 { out[di] = 92; di += 1; i += 1; } // \\
else if e == 47 { out[di] = 47; di += 1; i += 1; } // \/
else if e == 98 { out[di] = 8; di += 1; i += 1; } // \b
else if e == 102 { out[di] = 12; di += 1; i += 1; } // \f
else if e == 110 { out[di] = 10; di += 1; i += 1; } // \n
else if e == 114 { out[di] = 13; di += 1; i += 1; } // \r
else if e == 116 { out[di] = 9; di += 1; i += 1; } // \t
else if e == 117 { // \uXXXX
hpos := i + 1;
u := try self.read_hex4(hpos, end);
if u >= 0xD800 and u <= 0xDBFF {
// high surrogate: require a following \uYYYY low surrogate
lpos := hpos + 4;
if lpos + 2 > end { raise error.BadEscape; }
if self.src[lpos] != 92 or self.src[lpos + 1] != 117 { raise error.BadEscape; }
lo := try self.read_hex4(lpos + 2, end);
if lo < 0xDC00 or lo > 0xDFFF { raise error.BadEscape; }
cp := 0x10000 + ((u - 0xD800) << 10) + (lo - 0xDC00);
di += encode_utf8(cp, @out[di]);
i = lpos + 6;
} else {
if u >= 0xDC00 and u <= 0xDFFF { raise error.BadEscape; } // lone low surrogate
di += encode_utf8(u, @out[di]);
i = hpos + 4;
}
}
else { raise error.BadEscape; }
} else {
out[di] = c; di += 1; i += 1;
}
}
return di;
}
// Parse a string starting at the opening quote (current `pos`). Returns
// a zero-copy VIEW into `src` when the body has no escapes; otherwise
// decodes into an `alloc`-ed buffer (bounded by the raw span). `pos`
// ends just past the closing quote.
parse_string :: (self: *Parser) -> (string, !JsonParseError) {
self.pos += 1; // consume opening quote
start := self.pos;
has_escape := false;
i := start;
while i < self.src.len {
c := self.src[i];
if c == 34 { break; } // closing quote
if c == 92 { // backslash escapes the next byte
has_escape = true;
i += 1;
if i >= self.src.len { raise error.UnexpectedEnd; }
}
i += 1;
}
if i >= self.src.len { raise error.UnexpectedEnd; } // unterminated
end := i;
if !has_escape {
self.pos = end + 1;
return string.{ ptr = @self.src[start], len = end - start };
}
raw_len := end - start; // decoded length <= raw_len (escapes shrink)
out : [*]u8 = xx self.alloc.alloc(raw_len);
dlen := try self.decode_into(start, end, out);
self.pos = end + 1;
return string.{ ptr = out, len = dlen };
}
// Parse an s64 integer (optional '-', then digits). Rejects leading
// zeros, a fraction/exponent tail, and any value outside s64 — all
// `BadNumber`. Accumulates in NEGATIVE space so s64 MIN parses exactly.
parse_number :: (self: *Parser) -> (s64, !JsonParseError) {
// s64 bounds, built positionally because |MIN| is not a
// representable positive s64 literal. `min_div10` is `MIN / 10`
// truncated toward zero (remainder -8) — the digit loop's overflow
// threshold. Accumulation runs in NEGATIVE space so MIN is exact.
s64_min := 0 - 9223372036854775807 - 1;
min_div10 := 0 - 922337203685477580;
neg := false;
if self.src[self.pos] == 45 { neg = true; self.pos += 1; } // '-'
if self.pos >= self.src.len { raise error.BadNumber; } // '-' with no digit
dstart := self.pos;
c0 := self.src[self.pos];
if c0 < 48 or c0 > 57 { raise error.BadNumber; }
val : s64 = 0;
digits := 0;
while self.pos < self.src.len {
c := self.src[self.pos];
if c < 48 or c > 57 { break; }
d := (cast(s64) c) - 48;
if val < min_div10 { raise error.BadNumber; }
if val == min_div10 and d > 8 { raise error.BadNumber; }
val = val * 10 - d;
digits += 1;
self.pos += 1;
}
if self.src[dstart] == 48 and digits > 1 { raise error.BadNumber; } // no leading zeros
if self.pos < self.src.len {
nc := self.src[self.pos];
if nc == 46 or nc == 101 or nc == 69 { raise error.BadNumber; } // '.' / 'e' / 'E' — ints only
}
if !neg {
if val == s64_min { raise error.BadNumber; } // |MIN| not representable as +s64
val = 0 - val;
}
return val;
}
// Parse an array starting at '['. Builds an `Array` through `alloc`.
parse_array :: (self: *Parser) -> (Value, !JsonParseError) {
self.pos += 1; // consume '['
arr : Array = .{};
self.skip_ws();
if self.pos < self.src.len and self.src[self.pos] == 93 { // empty ']'
self.pos += 1;
return Value.array(arr);
}
loop := true;
while loop {
v := try self.parse_value();
arr.add(v, self.alloc);
self.skip_ws();
if self.pos >= self.src.len { raise error.UnexpectedEnd; }
c := self.src[self.pos];
if c == 44 { self.pos += 1; } // ',' more
else if c == 93 { self.pos += 1; loop = false; } // ']' done
else { raise error.UnexpectedToken; }
}
return Value.array(arr);
}
// Parse an object starting at '{'. Keys must be strings; insertion
// order is preserved (duplicate keys are kept, never merged).
parse_object :: (self: *Parser) -> (Value, !JsonParseError) {
self.pos += 1; // consume '{'
obj : Object = .{};
self.skip_ws();
if self.pos < self.src.len and self.src[self.pos] == 125 { // empty '}'
self.pos += 1;
return Value.object(obj);
}
loop := true;
while loop {
self.skip_ws();
if self.pos >= self.src.len { raise error.UnexpectedEnd; }
if self.src[self.pos] != 34 { raise error.UnexpectedToken; } // key must be a string
key := try self.parse_string();
self.skip_ws();
if self.pos >= self.src.len { raise error.UnexpectedEnd; }
if self.src[self.pos] != 58 { raise error.UnexpectedToken; } // ':'
self.pos += 1;
v := try self.parse_value();
obj.put(key, v, self.alloc);
self.skip_ws();
if self.pos >= self.src.len { raise error.UnexpectedEnd; }
c := self.src[self.pos];
if c == 44 { self.pos += 1; } // ',' more
else if c == 125 { self.pos += 1; loop = false; } // '}' done
else { raise error.UnexpectedToken; }
}
return Value.object(obj);
}
// Parse any single value (after skipping leading whitespace).
parse_value :: (self: *Parser) -> (Value, !JsonParseError) {
self.skip_ws();
if self.pos >= self.src.len { raise error.UnexpectedEnd; }
c := self.src[self.pos];
if c == 123 { return try self.parse_object(); } // '{'
if c == 91 { return try self.parse_array(); } // '['
if c == 34 { s := try self.parse_string(); return Value.str(s); } // '"'
if c == 116 { try self.expect_lit("true"); return Value.bool_(true); } // 't'
if c == 102 { try self.expect_lit("false"); return Value.bool_(false); } // 'f'
if c == 110 { try self.expect_lit("null"); nv : Value = .null_; return nv; } // 'n'
if c == 45 or (c >= 48 and c <= 57) { n := try self.parse_number(); return Value.int_(n); } // '-' / digit
raise error.UnexpectedToken;
}
}
// Parse a complete JSON document from `src` into the value model, using
// `alloc` for composite nodes and decoded (escaped) strings. Un-escaped
// string values are VIEWS into `src` and are valid only while `src` lives.
// Trailing non-whitespace after the value raises `error.TrailingGarbage`.
parse :: (src: string, alloc: Allocator) -> (Value, !JsonParseError) {
p := Parser.{ src = src, alloc = alloc };
v := try p.parse_value();
p.skip_ws();
if p.pos != p.src.len { raise error.TrailingGarbage; }
return v;
}