amduat/tier1/asl-store-index-1.md

465 lines
16 KiB
Markdown
Raw Normal View History

2026-01-17 11:18:00 +01:00
# ASL/STORE-INDEX/1 — Store Semantics and Contracts for ASL Core Index
Status: Draft
Owner: Niklas Rydberg
Version: 0.1.0
SoT: No
Last Updated: 2025-11-16
Linked Phase Pack: N/A
Tags: [deterministic, index, log, storage]
<!-- Source: /amduat-api/tier1/asl-store-index.md | Canonical: /amduat/tier1/asl-store-index-1.md -->
**Document ID:** `ASL/STORE-INDEX/1`
**Layer:** L1 — Store lifecycle and replay contracts (no encoding)
**Depends on (normative):**
* `ASL/1-CORE-INDEX` — semantic index model
* `ASL/LOG/1` — append-only log semantics
**Informative references:**
* `ENC/ASL-CORE-INDEX/1` — index segment encoding
* `ASL/SYSTEM/1` — unified system view (PEL/TGK/federation alignment)
* `TGK/1` — TGK semantics and visibility alignment
* `TGK/1-CORE` — EdgeBody and EdgeTypeId definitions
© 2025 Niklas Rydberg.
## License
Except where otherwise noted, this document (text and diagrams) is licensed under
the Creative Commons Attribution 4.0 International License (CC BY 4.0).
The identifier registries and mapping tables (e.g. TypeTag IDs, HashId
assignments, EdgeTypeId tables) are additionally made available under CC0 1.0
Universal (CC0) to enable unrestricted reuse in implementations and derivative
specifications.
Code examples in this document are provided under the Apache License 2.0 unless
explicitly stated otherwise. Test vectors, where present, are dedicated to the
public domain under CC0 1.0.
---
## 1. Purpose
This document defines the **operational and store-level semantics** required to implement ASL-CORE-INDEX.
It specifies:
* **Block lifecycle**: creation, sealing, retention, GC
* **Index segment lifecycle**: creation, append, seal, visibility
* **Snapshot identity and log positions** for deterministic replay
* **Append-only log semantics**
* **Lookup, visibility, and crash recovery rules**
* **Small vs large block handling**
It **does not define encoding** (see `ENC/ASL-CORE-INDEX/1`) or semantic mapping (see `ASL/1-CORE-INDEX`).
2026-01-17 12:21:15 +01:00
**Implementation note:** A degenerate store that skips segments/log replay (for
example, simple filesystem backends) is non-conformant to ASL/STORE-INDEX/1 and
is intended only for quickstart or legacy use.
2026-01-17 11:18:00 +01:00
**Informative references:**
* `ASL/SYSTEM/1` — unified system view (PEL/TGK/federation alignment)
* `TGK/1` — TGK semantics and visibility alignment
* `TGK/1-CORE` — EdgeBody and EdgeTypeId definitions
---
## 2. Scope
Covers:
* Lifecycle of **blocks** and **index entries**
* Snapshot and CURRENT consistency guarantees
* Deterministic replay and recovery
* GC and tombstone semantics
* Packing policy for small vs large artifacts
Excludes:
* Disk-level encoding
* Sharding or acceleration strategies (see ASL/INDEX-ACCEL/1)
* Memory residency or caching
* Federation, PEL, or TGK semantics (see `TGK/1` and `TGK/1-CORE`)
---
## 3. Core Concepts
### 3.1 Block
* **Definition:** Immutable storage unit containing artifact bytes.
* **Identifier:** BlockID (opaque, unique).
* **Properties:**
* Once sealed, contents never change.
* Can be referenced by multiple artifacts.
* May be pinned by snapshots for retention.
* Allocation method is implementation-defined (e.g., hash or sequence).
### 3.2 Index Segment
Segments group index entries and provide **persistence and recovery units**.
* **Open segment:** accepting new index entries, not visible for lookup.
* **Sealed segment:** closed for append, log-visible, snapshot-pinnable.
* **Segment components:** header, optional bloom filter, index records, footer.
* **Segment visibility:** only after seal and log append.
### 3.3 Append-Only Log
All store-visible mutations are recorded in a **strictly ordered, append-only log**:
* Entries include:
* Index additions
* Tombstones
* Segment seals
* Log is replayable to reconstruct CURRENT.
* Log semantics are defined in `ASL/LOG/1`.
### 3.4 Snapshot Identity and Log Position
To make CURRENT referencable and replayable, ASL-STORE-INDEX defines:
* **SnapshotID**: opaque, immutable identifier for a snapshot.
* **LogPosition**: monotonic integer position in the append-only log.
* **IndexState**: `(SnapshotID, LogPosition)`.
Deterministic replay is defined as:
```
Index(SnapshotID, LogPosition) = Snapshot[SnapshotID] + replay(log[0:LogPosition])
```
Snapshots and log positions are required for checkpointing, federation, and deterministic recovery.
2026-01-17 11:46:57 +01:00
**Implementation note (determinism):** This repository interprets `LogPosition`
as the inclusive `logseq` upper bound defined by `ASL/LOG/1`, not a byte offset
into the log file. Snapshot anchors use their record `logseq` as the snapshot's
log position.
2026-01-17 11:18:00 +01:00
### 3.5 Artifact Location
* **ArtifactExtent**: `(BlockID, offset, length)` identifying a byte slice within a block.
* **ArtifactLocation**: ordered list of `ArtifactExtent` values that, when concatenated, produce the artifact bytes.
* Multi-extent locations allow a single artifact to be striped across multiple blocks.
---
## 4. PUT/GET Contract (Normative)
### 4.1 PUT Signature
```
put(artifact) -> (ArtifactKey, IndexState)
```
* `ArtifactKey` is the content identity (ASL/1-CORE-INDEX).
* `IndexState = (SnapshotID, LogPosition)` after the PUT is admitted.
### 4.2 PUT Semantics
1. **Structural registration (if applicable)**: if a structural index (SID -> DAG) exists, it MUST register the artifact and reuse existing SID entries.
2. **Materialization (if applicable)**: if the artifact is lazy, materialize deterministically to derive `ArtifactKey`.
3. **Deduplication**: lookup `ArtifactKey` at CURRENT. If present, PUT MUST succeed without writing bytes or adding a new index entry.
4. **Storage**: if absent, write bytes to one or more sealed blocks and produce `ArtifactLocation`.
5. **Index mutation**: append an index entry mapping `ArtifactKey -> ArtifactLocation` and record visibility via log order.
### 4.3 PUT Guarantees
* PUT is idempotent for identical artifacts.
* No visible index entry points to mutable or missing bytes.
* Visibility follows log order and seal rules defined in this document.
### 4.4 GET Signature
```
get(ArtifactKey, IndexState?) -> bytes | NOT_FOUND
```
* `IndexState` defaults to CURRENT when omitted.
### 4.5 GET Semantics
2026-01-17 12:21:15 +01:00
1. Resolve `ArtifactKey -> ArtifactLocation` using `Index(snapshot, log_position)`.
2026-01-17 11:18:00 +01:00
2. If no entry exists, return `NOT_FOUND`.
3. Otherwise, read exactly the referenced `(BlockID, offset, length)` bytes and return them verbatim.
GET MUST NOT mutate state or trigger materialization.
### 4.6 Failure Semantics
* Partial writes MUST NOT become visible.
* Replay of snapshot + log after crash MUST reconstruct a valid CURRENT.
* Implementations MAY use caching, but MUST preserve determinism.
---
## 5. Block Lifecycle Semantics
| Event | Description | Semantic Guarantees |
| ------------------ | ------------------------------------- | ------------------------------------------------------------- |
| Creation | Block allocated; bytes may be written | Not visible to index until sealed |
| Sealing | Block is finalized and immutable | Sealed blocks are stable and safe to reference from index |
| Retention | Block remains accessible | Blocks referenced by snapshots or CURRENT must not be removed |
| Garbage Collection | Block may be deleted | Only unpinned, unreachable blocks may be removed |
Notes:
* Sealing ensures any index entry referencing the block is immutable.
* Retention is driven by snapshot and log visibility rules.
* GC must **never violate CURRENT reconstruction guarantees**.
---
## 6. Segment Lifecycle Semantics
### 5.1 Creation
* Open segment is allocated.
* Index entries appended in log order.
* Entries are invisible until segment seal and log append.
### 5.2 Seal
* Segment is closed to append.
* Seal record is written to append-only log.
* Segment becomes visible for lookup.
* Sealed segment may be snapshot-pinned.
### 5.3 Snapshot Interaction
* Snapshots capture sealed segments.
* Open segments need not survive snapshot.
* Segments below snapshot are replay anchors.
2026-01-17 12:21:15 +01:00
### 5.3.1 Segment State Machine (Informative)
```
OPEN -> SEALED -> VISIBLE -> GC_ELIGIBLE
```
* **OPEN:** accepting new index records; not visible.
* **SEALED:** immutable on disk; not yet visible until log-admitted.
* **VISIBLE:** seal record admitted by log replay; visible for lookup.
* **GC_ELIGIBLE:** no snapshots/log positions reference the segment.
### 5.4 Index/Log Bootstrap Flow (Informative)
1. **Initialize store**: load latest snapshot anchor (if any); otherwise start
with an empty index.
2. **Load sealed segments**: from snapshot metadata, locate segment files and
verify their hashes before admitting them.
3. **Replay log**: scan records with `logseq > snapshot.logseq` in order and
apply `SEGMENT_SEAL`, tombstones, and lifts.
4. **Compute CURRENT**: resolve visibility and shadowing to produce the
effective index view for queries.
This flow is deterministic and idempotent; re-running it yields the same
CURRENT state for a fixed `(SnapshotID, LogPosition)`.
2026-01-17 11:18:00 +01:00
---
## 7. Visibility and Lookup Semantics
### 6.1 Visibility Rules
* Entry visible **iff**:
* The block is sealed.
* Log record exists at position ≤ CURRENT.
* Segment seal recorded in log.
* Entries above CURRENT or referencing unsealed blocks are invisible.
### 6.2 Lookup Semantics
To resolve an `ArtifactKey`:
1. Identify all visible segments ≤ CURRENT.
2. Search segments in **reverse seal-log order** (highest seal log position first).
3. Return first matching entry.
4. Respect tombstones to shadow prior entries.
Determinism:
2026-01-17 12:21:15 +01:00
* Lookup results are identical across platforms given the same snapshot and log position.
2026-01-17 11:18:00 +01:00
* Accelerations (bloom filters, sharding, SIMD) **do not alter correctness**.
---
## 8. Snapshot Interaction
* Snapshots capture the set of **sealed blocks** and **sealed index segments** at a point in time.
* Blocks referenced by a snapshot are **pinned** and cannot be garbage-collected until snapshot expiration.
* CURRENT is reconstructed as:
```
CURRENT = snapshot_state + replay(log)
```
Segment and block visibility rules:
| Entity | Visible in snapshot | Visible in CURRENT |
| -------------------- | ---------------------------- | ------------------------------ |
| Open segment/block | No | Only after seal and log append |
| Sealed segment/block | Yes, if included in snapshot | Yes, replayed from log |
| Tombstone | Yes, if log-recorded | Yes, shadows prior entries |
---
## 9. Garbage Collection
Eligibility for GC:
* Segments: sealed, no references from CURRENT or snapshots.
* Blocks: unpinned, unreferenced by any segment or artifact.
Rules:
* GC is safe **only on sealed segments and blocks**.
* Must respect snapshot pins.
* Tombstones may aid in invalidating unreachable blocks.
* Snapshots retained for provenance or receipt verification MUST remain pinned.
Outcome:
* GC never violates CURRENT reconstruction.
* Blocks can be reclaimed without breaking provenance.
---
## 10. Tombstone Semantics
* Optional marker to invalidate prior mappings.
* Visibility rules identical to regular index entries.
* Used to maintain deterministic CURRENT in face of shadowing or deletions.
2026-01-17 11:46:57 +01:00
* `scope` and `reason_code` are policy metadata only; they do not affect
shadowing order or replay determinism.
* Tombstone lifts cancel only the referenced tombstone record for the same
artifact; other tombstones remain effective until lifted.
* Snapshot + log replay applies tombstones and lifts in `logseq` order; a lift
that occurs after a snapshot becomes effective only when replay reaches its
`logseq`.
2026-01-17 11:18:00 +01:00
---
## 11. Small vs Large Block Handling
### 11.1 Definitions
| Term | Meaning |
| ----------------- | --------------------------------------------------------------------- |
| **Small block** | Block containing artifact bytes below a threshold `T_small`. |
| **Large block** | Block containing artifact bytes ≥ `T_small`. |
| **Mixed segment** | Segment containing both small and large blocks (discouraged). |
| **Packing** | Combining multiple small artifacts into a single physical block. |
| **BlockID** | Opaque identifier for a block; addressing is identical for all sizes. |
Small vs large classification is **store-level only** and transparent to ASL-CORE and index layers.
`T_small` is configurable per deployment.
### 11.2 Packing Rules
1. **Small blocks may be packed together** to reduce storage overhead.
2. **Large blocks are never packed with other artifacts**.
3. Mixed segments are **allowed but discouraged**; implementations MAY warn when mixing occurs.
### 11.3 Segment Allocation Rules
1. Small blocks are allocated into segments optimized for packing efficiency.
2. Large blocks are allocated into segments optimized for sequential I/O.
3. Segment sealing and visibility rules remain unchanged.
### 11.4 Indexing and Addressing
All blocks are addressed uniformly:
```
ArtifactExtent = (BlockID, offset, length)
ArtifactLocation = [ArtifactExtent...]
```
Packing does **not** affect index semantics or determinism. Multi-extent ArtifactLocations are allowed.
### 11.5 GC and Retention
1. Packed small blocks can be reclaimed only when **all contained artifacts** are unreachable.
2. Large blocks are reclaimed per block.
Invariant: GC must never remove bytes still referenced by CURRENT or snapshots.
---
## 12. Crash and Recovery Semantics
* Open segments or unsealed blocks may be lost; no invariant is broken.
* Recovery procedure:
1. Mount last checkpoint snapshot.
2. Replay append-only log from checkpoint.
3. Reconstruct CURRENT.
* Recovery is **deterministic and idempotent**.
* Segments and blocks **never partially visible** after crash.
---
## 13. Normative Invariants
1. Sealed blocks are immutable.
2. Index entries referencing blocks are immutable once visible.
3. Shadowing follows strict log order.
4. Replay of snapshot + log uniquely reconstructs CURRENT.
5. GC cannot remove blocks or segments needed by snapshot or CURRENT.
6. Tombstones shadow prior entries without deleting underlying blocks prematurely.
7. IndexState `(SnapshotID, LogPosition)` uniquely identifies CURRENT.
---
2026-01-17 12:21:15 +01:00
## 13.1 Conformance Checklist (Informative)
* Reject visibility for any entry not admitted by replay.
* Enforce immutability of sealed blocks and visible segments.
* Ensure replay is deterministic and idempotent for a fixed index state.
* Verify tombstone + lift behavior across snapshots.
* Prevent GC of segments/blocks referenced by CURRENT or snapshots.
---
2026-01-17 11:18:00 +01:00
## 14. Non-Goals
* Disk-level encoding (ENC-ASL-CORE-INDEX).
* Memory layout or caching.
* Sharding or performance heuristics.
* Federation / multi-domain semantics (handled elsewhere).
* Block packing strategies beyond the policy rules here.
---
## 15. Relationship to Other Layers
| Layer | Responsibility |
| ------------------ | ---------------------------------------------------------------------------- |
| ASL-CORE | Artifact semantics, existence of blocks, immutability |
| ASL-CORE-INDEX | Semantic mapping of ArtifactKey → ArtifactLocation |
| ASL-STORE-INDEX | Lifecycle and operational contracts for blocks and segments |
| ENC-ASL-CORE-INDEX | Bytes-on-disk layout for segments, index records, and optional bloom filters |
---
## 16. Summary
The tier1 ASL-STORE-INDEX specification:
* Defines **block lifecycle** and **segment lifecycle**.
* Makes **snapshot identity and log positions** explicit for replay.
* Ensures deterministic visibility, lookup, and crash recovery.
* Formalizes GC safety and tombstone behavior.
* Adds clear **small vs large block** handling without changing core semantics.