P2.2: fix put_file content-addressing — hash the published bytes (single source read)

put_file hashed the source path, then copied the source again — two reads.
A source mutated in between would publish bytes whose digest != returned key,
breaking the content-addressed invariant. Now copy the source once into a
provisional staging file, derive the key from the SHA-256 of that staged file
(the exact bytes published), then dedup/atomic-rename. Guarantees
key == digest(published object) with a single source read.

Extends the acceptance test: re-hashes the stored object and asserts it equals
the returned key (and std.hash / shasum of the fixture), asserts cross-path
dedup (put_file and put_bytes of identical content share one object), and
asserts the staging temp is cleaned up on both the success and dedup paths.
This commit is contained in:
agra
2026-06-06 00:47:45 +03:00
parent 68c002ab06
commit 3bc019c736
2 changed files with 78 additions and 19 deletions

View File

@@ -6,13 +6,20 @@
// `<root>/objects/<digest>`. This key is what populates an
// Artifact.sha256 / Artifact.storage_key at the domain boundary.
//
// Publish is a two-phase write: bytes are first written to
// `<root>/staging/<key>`, then atomically renamed into
// `<root>/objects/<key>`. The rename is the only operation that makes an
// object visible at its final path, so an interrupted or failed write
// never leaves a torn object — a half-written staging file is not
// reachable as `objects/<key>`. Staging and objects share `<root>` (one
// filesystem), so the rename is atomic.
// Publish is a two-phase write: bytes are first written under
// `<root>/staging/`, then atomically renamed into `<root>/objects/<key>`.
// The rename is the only operation that makes an object visible at its
// final path, so an interrupted or failed write never leaves a torn
// object — a half-written staging file is not reachable as
// `objects/<key>`. Staging and objects share `<root>` (one filesystem),
// so the rename is atomic.
//
// `put_bytes` stages the in-memory bytes at `staging/<key>` (the key is
// known up front). `put_file` reads its source exactly once: it copies
// the source into a provisional `staging/incoming-<n>`, then derives the
// key from the SHA-256 of THAT staged file — the exact bytes that get
// published. So `key == digest(published object)` holds even if the
// source is mutated after the copy; the source is never read twice.
//
// Dedup: identical bytes hash to the same key, so a put whose object
// already exists returns immediately without re-staging or rewriting.
@@ -56,9 +63,12 @@ digest_of_file :: (path: string) -> (string, !StoreErr) {
Store :: struct {
root: string;
// Monotonic per-store counter naming `put_file`'s provisional staging
// files, so concurrent file puts don't clobber each other's temp copy.
seq: s64;
init :: (root: string) -> Store {
return Store.{ root = root };
return Store.{ root = root, seq = 0 };
}
objects_dir :: (self: *Store) -> string { return path_join(self.root, "objects"); }
@@ -80,10 +90,14 @@ Store :: struct {
return sp;
}
// Phase 1 (file source): copy `src`'s bytes into `staging/<key>`.
stage_copy :: (self: *Store, key: string, src: string) -> (string, !StoreErr) {
// Phase 1 (file source): copy `src` once into a provisional staging
// file `staging/incoming-<n>`. The key isn't known until these staged
// bytes are hashed, so the name is a per-put sequence — never
// `objects/<key>`, so an interrupted copy is never a published object.
stage_temp_copy :: (self: *Store, src: string) -> (string, !StoreErr) {
if !fs.create_dir_all(self.staging_dir()) { raise error.Stage; }
sp := self.staging_path(key);
self.seq += 1;
sp := self.staging_path(concat("incoming-", int_to_string(self.seq)));
if !fs.copy_file(src, sp) { raise error.Stage; }
return sp;
}
@@ -106,11 +120,18 @@ Store :: struct {
return key;
}
// Store a file's bytes and return their storage key. Dedup as above.
// Store a file's bytes and return their storage key. The source is
// read exactly once — copied into staging, then hashed there — so the
// returned key is the SHA-256 of the bytes actually published, not of a
// separate read that could disagree. Dedup: if the object already
// exists, the staged copy is dropped and the existing key returned.
put_file :: (self: *Store, path: string) -> (string, !StoreErr) {
key := try digest_of_file(path);
if self.has(key) { return key; }
sp := try self.stage_copy(key, path);
sp := try self.stage_temp_copy(path);
key := try digest_of_file(sp);
if self.has(key) {
fs.delete_file(sp);
return key;
}
try self.publish(sp, key);
return key;
}