niklas/amduat-api

Fork 0

Carl Niklas Rydberg c2000cb6d7 Refine index specs for variable digests and visibility

2026-01-17 07:05:11 +01:00

7.5 KiB

Raw Blame History

ENC-ASL-CORE-INDEX

Encoding Specification for ASL Core Index

1. Purpose

This document defines the exact encoding of ASL index segments and records for storage and interoperability.

It translates the semantic model of ASL/1-CORE-INDEX and store contracts of ASL-STORE-INDEX into a deterministic bytes-on-disk layout. Variable-length digest requirements are defined in ASL/1-CORE-INDEX (tier1/asl-core-index.md).

It is intended for:

C libraries
Tools
API frontends
Memory-mapped access

It does not define:

Index semantics (see ASL/1-CORE-INDEX)
Store lifecycle behavior (see ASL-STORE-INDEX)
Acceleration semantics (see ASL/INDEX-ACCEL/1)

2. Encoding Principles

Little-endian representation
Fixed-width fields for deterministic access
No pointers or references; all offsets are file-relative
Packed structures; no compiler-introduced padding
Forward compatibility via version field
CRC or checksum protection for corruption detection

All multi-byte integers are little-endian unless explicitly noted.

3. Segment Layout

Each index segment file is laid out as follows:

+------------------+
| SegmentHeader    |
+------------------+
| BloomFilter[]    | (optional, opaque to semantics)
+------------------+
| IndexRecord[]    |
+------------------+
| DigestBytes[]    |
+------------------+
| ExtentRecord[]   |
+------------------+
| SegmentFooter    |
+------------------+

SegmentHeader: fixed-size, mandatory
BloomFilter: optional, opaque, segment-local
IndexRecord[]: array of index entries
DigestBytes[]: concatenated digest bytes referenced by IndexRecord
ExtentRecord[]: concatenated extent lists referenced by IndexRecord
SegmentFooter: fixed-size, mandatory

Offsets in the header define locations of Bloom filter and index records.

4. SegmentHeader

#pragma pack(push,1)
typedef struct {
    uint64_t magic;           // Unique magic number identifying segment file type
    uint16_t version;         // Encoding version
    uint16_t shard_id;        // Optional shard identifier
    uint32_t header_size;     // Total size of header including fields below

    uint64_t snapshot_min;    // Minimum snapshot ID for which segment entries are valid
    uint64_t snapshot_max;    // Maximum snapshot ID

    uint64_t record_count;    // Number of index entries
    uint64_t records_offset;  // File offset of IndexRecord array

    uint64_t bloom_offset;    // File offset of bloom filter (0 if none)
    uint64_t bloom_size;      // Size of bloom filter (0 if none)

    uint64_t digests_offset;  // File offset of DigestBytes array
    uint64_t digests_size;    // Total size in bytes of DigestBytes

    uint64_t extents_offset;  // File offset of ExtentRecord array
    uint64_t extent_count;    // Total number of ExtentRecord entries

    uint64_t flags;           // Reserved for future use
} SegmentHeader;
#pragma pack(pop)

Notes:

magic ensures the reader validates the segment type.
version allows forward-compatible extension.
snapshot_min / snapshot_max are reserved for future use and carry no visibility semantics in version 3.

5. IndexRecord

#pragma pack(push,1)
typedef struct {
    uint32_t hash_id;         // Hash algorithm identifier
    uint16_t digest_len;      // Digest length in bytes
    uint16_t reserved0;       // Reserved for alignment/future use
    uint64_t digest_offset;   // File offset of digest bytes for this entry

    uint64_t extents_offset;  // File offset of first ExtentRecord for this entry
    uint32_t extent_count;    // Number of ExtentRecord entries for this artifact
    uint32_t total_length;    // Total artifact length in bytes

    uint32_t flags;       // Optional flags (tombstone, reserved, etc.)
    uint32_t reserved;    // Reserved for alignment/future use
} IndexRecord;
#pragma pack(pop)

Notes:

hash_id + digest_len + digest_offset store the artifact key deterministically.
digest_len MUST be explicit in the encoding and MUST match the length implied by hash_id and StoreConfig.
extents_offset references the first ExtentRecord for this entry.
extent_count defines how many extents to read (may be 0 for tombstones; see ASL/1-CORE-INDEX in tier1/asl-core-index.md).
total_length is the exact artifact size in bytes.
Flags may indicate tombstone or other special status.

6. ExtentRecord

#pragma pack(push,1)
typedef struct {
    uint64_t block_id;    // ASL block identifier
    uint32_t offset;      // Offset within block
    uint32_t length;      // Length of this extent
} ExtentRecord;
#pragma pack(pop)

Notes:

Extents are concatenated in order to produce artifact bytes.
extent_count MUST be > 0 for visible (non-tombstone) entries.
total_length MUST equal the sum of length across the extents.

7. SegmentFooter

#pragma pack(push,1)
typedef struct {
    uint64_t crc64;          // CRC over header + bloom filter + index records + digest bytes + extents
    uint64_t seal_snapshot;  // Snapshot ID when segment was sealed
    uint64_t seal_time_ns;   // High-resolution seal timestamp
} SegmentFooter;
#pragma pack(pop)

Notes:

CRC ensures corruption detection during reads, covering all segment contents except the footer.
Seal information allows deterministic reconstruction of CURRENT state.

8. DigestBytes

Digest bytes are concatenated in a single byte array.
Each IndexRecord references its digest via digest_offset and digest_len.
The digest bytes MUST be immutable once the segment is sealed.

9. Bloom Filter

The bloom filter is optional and opaque to semantics.
Its purpose is lookup acceleration.
Must be deterministic: same entries → same bloom representation.
Segment-local only; no global assumptions.

10. Versioning and Compatibility

version field in header defines encoding.
Readers must reject unsupported versions.
New fields may be added in future versions only via version bump.
Existing fields must never change meaning.
Version 1 implies single-extent layout (legacy).
Version 2 introduces ExtentRecord lists and extents_offset / extent_count.
Version 3 introduces variable-length digest bytes with hash_id and digest_offset.

11. Alignment and Packing

All structures are packed (no compiler padding)
Multi-byte integers are little-endian
Memory-mapped readers can directly index IndexRecord[] using records_offset.
Extents are accessed via IndexRecord.extents_offset relative to the file base.

12. Summary of Encoding Guarantees

The ENC-ASL-CORE-INDEX specification ensures:

Deterministic layout across platforms
Direct mapping from semantic model (ArtifactKey → ArtifactLocation)
Immutability of sealed segments
Integrity validation via CRC
Forward-compatible extensibility

13. Relationship to Other Layers

Layer	Responsibility
ASL/1-CORE-INDEX	Defines semantic meaning of artifact → location mapping
ASL-STORE-INDEX	Defines lifecycle, visibility, and replay contracts
ASL/INDEX-ACCEL/1	Defines routing, filters, sharding (observationally inert)
ENC-ASL-CORE-INDEX	Defines exact bytes-on-disk format for segment persistence

This completes the stack: semantics → store behavior → encoding.

7.5 KiB Raw Blame History

ENC-ASL-CORE-INDEX

Encoding Specification for ASL Core Index

1. Purpose

2. Encoding Principles

3. Segment Layout

4. SegmentHeader

5. IndexRecord

6. ExtentRecord

7. SegmentFooter

8. DigestBytes

9. Bloom Filter

10. Versioning and Compatibility

11. Alignment and Packing

12. Summary of Encoding Guarantees

13. Relationship to Other Layers

7.5 KiB

Raw Blame History