amduat/tier1/enc-asl-core-index-1.md
2026-01-17 11:46:57 +01:00

372 lines
13 KiB
Markdown

# ENC/ASL-CORE-INDEX/1 — Encoding Specification for ASL Core Index
Status: Draft
Owner: Niklas Rydberg
Version: 0.1.0
SoT: No
Last Updated: 2025-11-16
Linked Phase Pack: N/A
Tags: [encoding, index, deterministic]
<!-- Source: /amduat-api/tier1/enc-asl-core-index.md | Canonical: /amduat/tier1/enc-asl-core-index-1.md -->
**Document ID:** `ENC/ASL-CORE-INDEX/1`
**Layer:** Index Encoding Profile (on top of ASL/1-CORE-INDEX + ASL/STORE-INDEX/1)
**Depends on (normative):**
* `ASL/1-CORE-INDEX` — semantic index model
* `ASL/STORE-INDEX/1` — store lifecycle and replay contracts
**Informative references:**
* `ASL/LOG/1` — append-only log semantics
© 2025 Niklas Rydberg.
## License
Except where otherwise noted, this document (text and diagrams) is licensed under
the Creative Commons Attribution 4.0 International License (CC BY 4.0).
The identifier registries and mapping tables (e.g. TypeTag IDs, HashId
assignments, EdgeTypeId tables) are additionally made available under CC0 1.0
Universal (CC0) to enable unrestricted reuse in implementations and derivative
specifications.
Code examples in this document are provided under the Apache License 2.0 unless
explicitly stated otherwise. Test vectors, where present, are dedicated to the
public domain under CC0 1.0.
---
## 1. Purpose
This document defines the **exact encoding of ASL index segments** and records for storage and interoperability.
It translates the **semantic model of ASL/1-CORE-INDEX** and **store contracts of ASL-STORE-INDEX** into a deterministic **bytes-on-disk layout**.
Variable-length digest requirements are defined in ASL/1-CORE-INDEX (`tier1/asl-core-index.md`).
This document incorporates the federation encoding addendum.
It is intended for:
* C libraries
* Tools
* API frontends
* Memory-mapped access
It does **not** define:
* Index semantics (see ASL/1-CORE-INDEX)
* Store lifecycle behavior (see ASL-STORE-INDEX)
* Acceleration semantics (see ASL/INDEX-ACCEL/1)
* TGK edge semantics or encodings (see `TGK/1` and `TGK/1-CORE`)
* Federation semantics (see federation/domain policy layers)
---
## 2. Encoding Principles
1. **Little-endian** representation
2. **Fixed-width fields** for deterministic access
3. **No pointers or references**; all offsets are file-relative
4. **Packed structures**; no compiler-introduced padding
5. **Forward compatibility** via version field
6. **CRC or checksum protection** for corruption detection
7. **Federation metadata** embedded in index records for deterministic cross-domain replay
All multi-byte integers are little-endian unless explicitly noted.
---
## 3. Segment Layout
Each index segment file is laid out as follows:
```
+------------------+
| SegmentHeader |
+------------------+
| BloomFilter[] | (optional, opaque to semantics)
+------------------+
| IndexRecord[] |
+------------------+
| DigestBytes[] |
+------------------+
| ExtentRecord[] |
+------------------+
| SegmentFooter |
+------------------+
```
* **SegmentHeader**: fixed-size, mandatory
* **BloomFilter**: optional, opaque, segment-local
* **IndexRecord[]**: array of index entries
* **DigestBytes[]**: concatenated digest bytes referenced by IndexRecord
* **ExtentRecord[]**: concatenated extent lists referenced by IndexRecord
* **SegmentFooter**: fixed-size, mandatory
Offsets in the header define locations of Bloom filter and index records.
### 3.1 Fixed Constants and Sizes
**Magic bytes (SegmentHeader.magic):** `ASLIDX03`
* ASCII bytes: `0x41 0x53 0x4c 0x49 0x44 0x58 0x30 0x33`
* Little-endian uint64 value: `0x33305844494c5341`
**Current encoding version:** `3`
**Fixed struct sizes (bytes):**
* `SegmentHeader`: 112
* `IndexRecord`: 48
* `ExtentRecord`: 16
* `SegmentFooter`: 24
**Section packing (no gaps):**
* `records_offset = header_size + bloom_size`
* `digests_offset = records_offset + (record_count * sizeof(IndexRecord))`
* `extents_offset = digests_offset + digests_size`
* `SegmentFooter` starts at `extents_offset + (extent_count * sizeof(ExtentRecord))`
All offsets MUST be file-relative, 8-byte aligned, and point to their respective arrays exactly as above.
### 3.2 Federation Defaults
This encoding integrates federation metadata into segments and records.
Legacy segments without federation fields MUST be treated as:
* `segment_domain_id = local`
* `segment_visibility = internal`
* `domain_id = local`
* `visibility = internal`
* `has_cross_domain_source = 0`
* `cross_domain_source = 0`
**Handling rules:**
* Encoders for version 3 MUST write explicit federation fields in both
`SegmentHeader` and `IndexRecord`; these fields are not optional in v3.
* Decoders MUST accept older versions that omit federation fields and apply the
defaults above.
* Decoders MUST reject v3 segments if federation fields are missing, malformed,
or contain out-of-range values (e.g., `visibility` not in {0,1} or
`has_cross_domain_source` not in {0,1}).
---
## 4. SegmentHeader
```c
#pragma pack(push,1)
typedef struct {
uint64_t magic; // Unique magic number identifying segment file type
uint16_t version; // Encoding version
uint16_t shard_id; // Optional shard identifier
uint32_t header_size; // Total size of header including fields below
uint64_t snapshot_min; // Minimum snapshot ID for which segment entries are valid
uint64_t snapshot_max; // Maximum snapshot ID
uint64_t record_count; // Number of index entries
uint64_t records_offset; // File offset of IndexRecord array
uint64_t bloom_offset; // File offset of bloom filter (0 if none)
uint64_t bloom_size; // Size of bloom filter (0 if none)
uint64_t digests_offset; // File offset of DigestBytes array
uint64_t digests_size; // Total size in bytes of DigestBytes
uint64_t extents_offset; // File offset of ExtentRecord array
uint64_t extent_count; // Total number of ExtentRecord entries
uint32_t segment_domain_id; // Domain owning this segment
uint8_t segment_visibility; // 0 = internal, 1 = published
uint8_t federation_version; // 0 if unused
uint16_t reserved0; // Reserved (must be 0)
uint64_t flags; // Segment flags (must be 0 in version 3)
} SegmentHeader;
#pragma pack(pop)
```
**Notes:**
* `magic` ensures the reader validates the segment type.
* `version` allows forward-compatible extension.
* `snapshot_min` / `snapshot_max` are reserved for future use and carry no visibility semantics in version 3.
* `segment_domain_id` identifies the owning domain for all records in this segment.
* `segment_visibility` MUST be the maximum visibility of all records in the segment.
* `federation_version` MUST be `0` unless a future federation encoding version is defined.
* `reserved0` MUST be `0`.
* `header_size` MUST be `112`.
* `flags` MUST be `0`. Readers MUST reject non-zero values.
---
## 5. IndexRecord
```c
#pragma pack(push,1)
typedef struct {
uint32_t hash_id; // Hash algorithm identifier
uint16_t digest_len; // Digest length in bytes
uint16_t reserved0; // Reserved for alignment/future use
uint64_t digest_offset; // File offset of digest bytes for this entry
uint64_t extents_offset; // File offset of first ExtentRecord for this entry
uint32_t extent_count; // Number of ExtentRecord entries for this artifact
uint32_t total_length; // Total artifact length in bytes
uint32_t domain_id; // Domain identifier for this artifact
uint8_t visibility; // 0 = internal, 1 = published
uint8_t has_cross_domain_source; // 0 or 1
uint16_t reserved1; // Reserved (must be 0)
uint32_t cross_domain_source; // Source domain if imported (valid if has_cross_domain_source=1)
uint32_t flags; // Optional flags (tombstone, reserved, etc.)
} IndexRecord;
#pragma pack(pop)
```
**Notes:**
* `hash_id` + `digest_len` + `digest_offset` store the artifact key deterministically.
* `digest_len` MUST be explicit in the encoding and MUST match the length implied by `hash_id` and StoreConfig.
* `digest_offset` MUST be within `[digests_offset, digests_offset + digests_size)`.
* `extents_offset` references the first ExtentRecord for this entry.
* `extent_count` defines how many extents to read (may be 0 for tombstones; see ASL/1-CORE-INDEX in `tier1/asl-core-index.md`).
* `total_length` is the exact artifact size in bytes.
* Flags may indicate tombstone or other special status.
* `domain_id` MUST be present and stable across replay.
* `visibility` MUST be `0` or `1`.
* `has_cross_domain_source` MUST be `0` or `1`.
* `cross_domain_source` MUST be `0` when `has_cross_domain_source=0`.
* `reserved0` and `reserved1` MUST be `0`.
### 5.1 IndexRecord Flags
```
IDX_FLAG_TOMBSTONE = 0x00000001
```
* If `IDX_FLAG_TOMBSTONE` is set, then `extent_count`, `total_length`, and `extents_offset` MUST be `0`.
* All other bits are reserved and MUST be `0`. Readers MUST reject unknown flag bits.
* Tombstones MUST retain valid `domain_id` and `visibility` to ensure domain-local shadowing.
---
## 6. ExtentRecord
```c
#pragma pack(push,1)
typedef struct {
uint64_t block_id; // ASL block identifier
uint32_t offset; // Offset within block
uint32_t length; // Length of this extent
} ExtentRecord;
#pragma pack(pop)
```
**Notes:**
* Extents are concatenated in order to produce artifact bytes.
* `extent_count` MUST be > 0 for visible (non-tombstone) entries.
* `total_length` MUST equal the sum of `length` across the extents.
* `offset` and `length` MUST describe a contiguous slice within the referenced block.
---
## 7. SegmentFooter
```c
#pragma pack(push,1)
typedef struct {
uint64_t crc64; // CRC over header + bloom filter + index records + digest bytes + extents
uint64_t seal_snapshot; // Snapshot ID when segment was sealed
uint64_t seal_time_ns; // High-resolution seal timestamp
} SegmentFooter;
#pragma pack(pop)
```
**Notes:**
* CRC ensures corruption detection during reads, covering all segment contents except the footer.
* Seal information allows deterministic reconstruction of CURRENT state.
**Implementation note:** The segment file bytes are hashed for log sealing as
defined in `ENC/ASL-LOG/1`. The hash covers the footer as written, so sealing
must occur after the footer is finalized.
---
## 8. DigestBytes
* Digest bytes are concatenated in a single byte array.
* Each IndexRecord references its digest via `digest_offset` and `digest_len`.
* The digest bytes MUST be immutable once the segment is sealed.
---
## 9. Bloom Filter
* The bloom filter is **optional** and opaque to semantics.
* Its purpose is **lookup acceleration**.
* Must be deterministic: same entries → same bloom representation.
* Segment-local only; no global assumptions.
---
## 10. Versioning and Compatibility
* `version` field in header defines encoding.
* Readers must **reject unsupported versions**.
* New fields may be added in future versions only via version bump.
* Existing fields must **never change meaning**.
* Version `1` implies single-extent layout (legacy).
* Version `2` introduces `ExtentRecord` lists and `extents_offset` / `extent_count`.
* Version `3` introduces variable-length digest bytes with `hash_id` and `digest_offset`.
* Version `3` also integrates federation metadata in segment headers and index records.
### 10.1 Federation Compatibility Rules
* Legacy segments without federation fields are treated as local/internal (see 3.2).
* Tombstones MUST NOT shadow artifacts from other domains; domain matching is required.
---
## 11. Alignment and Packing
* All structures are **packed** (no compiler padding)
* Multi-byte integers are **little-endian**
* Memory-mapped readers can directly index `IndexRecord[]` using `records_offset`.
* Extents are accessed via `IndexRecord.extents_offset` relative to the file base.
---
## 12. Summary of Encoding Guarantees
The ENC-ASL-CORE-INDEX specification ensures:
1. **Deterministic layout** across platforms
2. **Direct mapping from semantic model** (ArtifactKey → ArtifactLocation)
3. **Immutability of sealed segments**
4. **Integrity validation** via CRC
5. **Forward-compatible extensibility**
---
## 13. Relationship to Other Layers
| Layer | Responsibility |
| ------------------ | ---------------------------------------------------------- |
| ASL/1-CORE-INDEX | Defines semantic meaning of artifact → location mapping |
| ASL-STORE-INDEX | Defines lifecycle, visibility, and replay contracts |
| ASL/INDEX-ACCEL/1 | Defines routing, filters, sharding (observationally inert) |
| ENC-ASL-CORE-INDEX | Defines exact bytes-on-disk format for segment persistence |
This completes the stack: **semantics → store behavior → encoding**.