amduat/tier1/asl-store-index-1.md
2026-01-17 11:46:57 +01:00

14 KiB

ASL/STORE-INDEX/1 — Store Semantics and Contracts for ASL Core Index

Status: Draft Owner: Niklas Rydberg Version: 0.1.0 SoT: No Last Updated: 2025-11-16 Linked Phase Pack: N/A Tags: [deterministic, index, log, storage]

Document ID: ASL/STORE-INDEX/1 Layer: L1 — Store lifecycle and replay contracts (no encoding)

Depends on (normative):

  • ASL/1-CORE-INDEX — semantic index model
  • ASL/LOG/1 — append-only log semantics

Informative references:

  • ENC/ASL-CORE-INDEX/1 — index segment encoding
  • ASL/SYSTEM/1 — unified system view (PEL/TGK/federation alignment)
  • TGK/1 — TGK semantics and visibility alignment
  • TGK/1-CORE — EdgeBody and EdgeTypeId definitions

© 2025 Niklas Rydberg.

License

Except where otherwise noted, this document (text and diagrams) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

The identifier registries and mapping tables (e.g. TypeTag IDs, HashId assignments, EdgeTypeId tables) are additionally made available under CC0 1.0 Universal (CC0) to enable unrestricted reuse in implementations and derivative specifications.

Code examples in this document are provided under the Apache License 2.0 unless explicitly stated otherwise. Test vectors, where present, are dedicated to the public domain under CC0 1.0.


1. Purpose

This document defines the operational and store-level semantics required to implement ASL-CORE-INDEX.

It specifies:

  • Block lifecycle: creation, sealing, retention, GC
  • Index segment lifecycle: creation, append, seal, visibility
  • Snapshot identity and log positions for deterministic replay
  • Append-only log semantics
  • Lookup, visibility, and crash recovery rules
  • Small vs large block handling

It does not define encoding (see ENC/ASL-CORE-INDEX/1) or semantic mapping (see ASL/1-CORE-INDEX).

Informative references:

  • ASL/SYSTEM/1 — unified system view (PEL/TGK/federation alignment)
  • TGK/1 — TGK semantics and visibility alignment
  • TGK/1-CORE — EdgeBody and EdgeTypeId definitions

2. Scope

Covers:

  • Lifecycle of blocks and index entries
  • Snapshot and CURRENT consistency guarantees
  • Deterministic replay and recovery
  • GC and tombstone semantics
  • Packing policy for small vs large artifacts

Excludes:

  • Disk-level encoding
  • Sharding or acceleration strategies (see ASL/INDEX-ACCEL/1)
  • Memory residency or caching
  • Federation, PEL, or TGK semantics (see TGK/1 and TGK/1-CORE)

3. Core Concepts

3.1 Block

  • Definition: Immutable storage unit containing artifact bytes.

  • Identifier: BlockID (opaque, unique).

  • Properties:

    • Once sealed, contents never change.
    • Can be referenced by multiple artifacts.
    • May be pinned by snapshots for retention.
    • Allocation method is implementation-defined (e.g., hash or sequence).

3.2 Index Segment

Segments group index entries and provide persistence and recovery units.

  • Open segment: accepting new index entries, not visible for lookup.
  • Sealed segment: closed for append, log-visible, snapshot-pinnable.
  • Segment components: header, optional bloom filter, index records, footer.
  • Segment visibility: only after seal and log append.

3.3 Append-Only Log

All store-visible mutations are recorded in a strictly ordered, append-only log:

  • Entries include:

    • Index additions
    • Tombstones
    • Segment seals
  • Log is replayable to reconstruct CURRENT.

  • Log semantics are defined in ASL/LOG/1.

3.4 Snapshot Identity and Log Position

To make CURRENT referencable and replayable, ASL-STORE-INDEX defines:

  • SnapshotID: opaque, immutable identifier for a snapshot.
  • LogPosition: monotonic integer position in the append-only log.
  • IndexState: (SnapshotID, LogPosition).

Deterministic replay is defined as:

Index(SnapshotID, LogPosition) = Snapshot[SnapshotID] + replay(log[0:LogPosition])

Snapshots and log positions are required for checkpointing, federation, and deterministic recovery.

Implementation note (determinism): This repository interprets LogPosition as the inclusive logseq upper bound defined by ASL/LOG/1, not a byte offset into the log file. Snapshot anchors use their record logseq as the snapshot's log position.

3.5 Artifact Location

  • ArtifactExtent: (BlockID, offset, length) identifying a byte slice within a block.
  • ArtifactLocation: ordered list of ArtifactExtent values that, when concatenated, produce the artifact bytes.
  • Multi-extent locations allow a single artifact to be striped across multiple blocks.

4. PUT/GET Contract (Normative)

4.1 PUT Signature

put(artifact) -> (ArtifactKey, IndexState)
  • ArtifactKey is the content identity (ASL/1-CORE-INDEX).
  • IndexState = (SnapshotID, LogPosition) after the PUT is admitted.

4.2 PUT Semantics

  1. Structural registration (if applicable): if a structural index (SID -> DAG) exists, it MUST register the artifact and reuse existing SID entries.
  2. Materialization (if applicable): if the artifact is lazy, materialize deterministically to derive ArtifactKey.
  3. Deduplication: lookup ArtifactKey at CURRENT. If present, PUT MUST succeed without writing bytes or adding a new index entry.
  4. Storage: if absent, write bytes to one or more sealed blocks and produce ArtifactLocation.
  5. Index mutation: append an index entry mapping ArtifactKey -> ArtifactLocation and record visibility via log order.

4.3 PUT Guarantees

  • PUT is idempotent for identical artifacts.
  • No visible index entry points to mutable or missing bytes.
  • Visibility follows log order and seal rules defined in this document.

4.4 GET Signature

get(ArtifactKey, IndexState?) -> bytes | NOT_FOUND
  • IndexState defaults to CURRENT when omitted.

4.5 GET Semantics

  1. Resolve ArtifactKey -> ArtifactLocation using Index(snapshot, log_prefix).
  2. If no entry exists, return NOT_FOUND.
  3. Otherwise, read exactly the referenced (BlockID, offset, length) bytes and return them verbatim.

GET MUST NOT mutate state or trigger materialization.

4.6 Failure Semantics

  • Partial writes MUST NOT become visible.
  • Replay of snapshot + log after crash MUST reconstruct a valid CURRENT.
  • Implementations MAY use caching, but MUST preserve determinism.

5. Block Lifecycle Semantics

Event Description Semantic Guarantees
Creation Block allocated; bytes may be written Not visible to index until sealed
Sealing Block is finalized and immutable Sealed blocks are stable and safe to reference from index
Retention Block remains accessible Blocks referenced by snapshots or CURRENT must not be removed
Garbage Collection Block may be deleted Only unpinned, unreachable blocks may be removed

Notes:

  • Sealing ensures any index entry referencing the block is immutable.
  • Retention is driven by snapshot and log visibility rules.
  • GC must never violate CURRENT reconstruction guarantees.

6. Segment Lifecycle Semantics

5.1 Creation

  • Open segment is allocated.
  • Index entries appended in log order.
  • Entries are invisible until segment seal and log append.

5.2 Seal

  • Segment is closed to append.
  • Seal record is written to append-only log.
  • Segment becomes visible for lookup.
  • Sealed segment may be snapshot-pinned.

5.3 Snapshot Interaction

  • Snapshots capture sealed segments.
  • Open segments need not survive snapshot.
  • Segments below snapshot are replay anchors.

7. Visibility and Lookup Semantics

6.1 Visibility Rules

  • Entry visible iff:

    • The block is sealed.
    • Log record exists at position ≤ CURRENT.
    • Segment seal recorded in log.
  • Entries above CURRENT or referencing unsealed blocks are invisible.

6.2 Lookup Semantics

To resolve an ArtifactKey:

  1. Identify all visible segments ≤ CURRENT.
  2. Search segments in reverse seal-log order (highest seal log position first).
  3. Return first matching entry.
  4. Respect tombstones to shadow prior entries.

Determinism:

  • Lookup results are identical across platforms given the same snapshot and log prefix.
  • Accelerations (bloom filters, sharding, SIMD) do not alter correctness.

8. Snapshot Interaction

  • Snapshots capture the set of sealed blocks and sealed index segments at a point in time.
  • Blocks referenced by a snapshot are pinned and cannot be garbage-collected until snapshot expiration.
  • CURRENT is reconstructed as:
CURRENT = snapshot_state + replay(log)

Segment and block visibility rules:

Entity Visible in snapshot Visible in CURRENT
Open segment/block No Only after seal and log append
Sealed segment/block Yes, if included in snapshot Yes, replayed from log
Tombstone Yes, if log-recorded Yes, shadows prior entries

9. Garbage Collection

Eligibility for GC:

  • Segments: sealed, no references from CURRENT or snapshots.
  • Blocks: unpinned, unreferenced by any segment or artifact.

Rules:

  • GC is safe only on sealed segments and blocks.
  • Must respect snapshot pins.
  • Tombstones may aid in invalidating unreachable blocks.
  • Snapshots retained for provenance or receipt verification MUST remain pinned.

Outcome:

  • GC never violates CURRENT reconstruction.
  • Blocks can be reclaimed without breaking provenance.

10. Tombstone Semantics

  • Optional marker to invalidate prior mappings.
  • Visibility rules identical to regular index entries.
  • Used to maintain deterministic CURRENT in face of shadowing or deletions.
  • scope and reason_code are policy metadata only; they do not affect shadowing order or replay determinism.
  • Tombstone lifts cancel only the referenced tombstone record for the same artifact; other tombstones remain effective until lifted.
  • Snapshot + log replay applies tombstones and lifts in logseq order; a lift that occurs after a snapshot becomes effective only when replay reaches its logseq.

11. Small vs Large Block Handling

11.1 Definitions

Term Meaning
Small block Block containing artifact bytes below a threshold T_small.
Large block Block containing artifact bytes ≥ T_small.
Mixed segment Segment containing both small and large blocks (discouraged).
Packing Combining multiple small artifacts into a single physical block.
BlockID Opaque identifier for a block; addressing is identical for all sizes.

Small vs large classification is store-level only and transparent to ASL-CORE and index layers. T_small is configurable per deployment.

11.2 Packing Rules

  1. Small blocks may be packed together to reduce storage overhead.
  2. Large blocks are never packed with other artifacts.
  3. Mixed segments are allowed but discouraged; implementations MAY warn when mixing occurs.

11.3 Segment Allocation Rules

  1. Small blocks are allocated into segments optimized for packing efficiency.
  2. Large blocks are allocated into segments optimized for sequential I/O.
  3. Segment sealing and visibility rules remain unchanged.

11.4 Indexing and Addressing

All blocks are addressed uniformly:

ArtifactExtent = (BlockID, offset, length)
ArtifactLocation = [ArtifactExtent...]

Packing does not affect index semantics or determinism. Multi-extent ArtifactLocations are allowed.

11.5 GC and Retention

  1. Packed small blocks can be reclaimed only when all contained artifacts are unreachable.
  2. Large blocks are reclaimed per block.

Invariant: GC must never remove bytes still referenced by CURRENT or snapshots.


12. Crash and Recovery Semantics

  • Open segments or unsealed blocks may be lost; no invariant is broken.

  • Recovery procedure:

    1. Mount last checkpoint snapshot.
    2. Replay append-only log from checkpoint.
    3. Reconstruct CURRENT.
  • Recovery is deterministic and idempotent.

  • Segments and blocks never partially visible after crash.


13. Normative Invariants

  1. Sealed blocks are immutable.
  2. Index entries referencing blocks are immutable once visible.
  3. Shadowing follows strict log order.
  4. Replay of snapshot + log uniquely reconstructs CURRENT.
  5. GC cannot remove blocks or segments needed by snapshot or CURRENT.
  6. Tombstones shadow prior entries without deleting underlying blocks prematurely.
  7. IndexState (SnapshotID, LogPosition) uniquely identifies CURRENT.

14. Non-Goals

  • Disk-level encoding (ENC-ASL-CORE-INDEX).
  • Memory layout or caching.
  • Sharding or performance heuristics.
  • Federation / multi-domain semantics (handled elsewhere).
  • Block packing strategies beyond the policy rules here.

15. Relationship to Other Layers

Layer Responsibility
ASL-CORE Artifact semantics, existence of blocks, immutability
ASL-CORE-INDEX Semantic mapping of ArtifactKey → ArtifactLocation
ASL-STORE-INDEX Lifecycle and operational contracts for blocks and segments
ENC-ASL-CORE-INDEX Bytes-on-disk layout for segments, index records, and optional bloom filters

16. Summary

The tier1 ASL-STORE-INDEX specification:

  • Defines block lifecycle and segment lifecycle.
  • Makes snapshot identity and log positions explicit for replay.
  • Ensures deterministic visibility, lookup, and crash recovery.
  • Formalizes GC safety and tombstone behavior.
  • Adds clear small vs large block handling without changing core semantics.