Carl Niklas Rydberg c595e2370a Clarify ASL index/log semantics

2026-01-17 11:46:57 +01:00

14 KiB

Raw Blame History

ASL/STORE-INDEX/1 — Store Semantics and Contracts for ASL Core Index

Status: Draft Owner: Niklas Rydberg Version: 0.1.0 SoT: No Last Updated: 2025-11-16 Linked Phase Pack: N/A Tags: [deterministic, index, log, storage]

Document ID: ASL/STORE-INDEX/1 Layer: L1 — Store lifecycle and replay contracts (no encoding)

Depends on (normative):

ASL/1-CORE-INDEX — semantic index model
ASL/LOG/1 — append-only log semantics

Informative references:

ENC/ASL-CORE-INDEX/1 — index segment encoding
ASL/SYSTEM/1 — unified system view (PEL/TGK/federation alignment)
TGK/1 — TGK semantics and visibility alignment
TGK/1-CORE — EdgeBody and EdgeTypeId definitions

License

Except where otherwise noted, this document (text and diagrams) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

The identifier registries and mapping tables (e.g. TypeTag IDs, HashId assignments, EdgeTypeId tables) are additionally made available under CC0 1.0 Universal (CC0) to enable unrestricted reuse in implementations and derivative specifications.

Code examples in this document are provided under the Apache License 2.0 unless explicitly stated otherwise. Test vectors, where present, are dedicated to the public domain under CC0 1.0.

1. Purpose

This document defines the operational and store-level semantics required to implement ASL-CORE-INDEX.

It specifies:

Block lifecycle: creation, sealing, retention, GC
Index segment lifecycle: creation, append, seal, visibility
Snapshot identity and log positions for deterministic replay
Append-only log semantics
Lookup, visibility, and crash recovery rules
Small vs large block handling

It does not define encoding (see ENC/ASL-CORE-INDEX/1) or semantic mapping (see ASL/1-CORE-INDEX).

Informative references:

ASL/SYSTEM/1 — unified system view (PEL/TGK/federation alignment)
TGK/1 — TGK semantics and visibility alignment
TGK/1-CORE — EdgeBody and EdgeTypeId definitions

2. Scope

Covers:

Lifecycle of blocks and index entries
Snapshot and CURRENT consistency guarantees
Deterministic replay and recovery
GC and tombstone semantics
Packing policy for small vs large artifacts

Excludes:

Disk-level encoding
Sharding or acceleration strategies (see ASL/INDEX-ACCEL/1)
Memory residency or caching
Federation, PEL, or TGK semantics (see TGK/1 and TGK/1-CORE)

3. Core Concepts

3.1 Block

Definition: Immutable storage unit containing artifact bytes.
Identifier: BlockID (opaque, unique).
Properties:
- Once sealed, contents never change.
- Can be referenced by multiple artifacts.
- May be pinned by snapshots for retention.
- Allocation method is implementation-defined (e.g., hash or sequence).

3.2 Index Segment

Segments group index entries and provide persistence and recovery units.

Open segment: accepting new index entries, not visible for lookup.
Sealed segment: closed for append, log-visible, snapshot-pinnable.
Segment components: header, optional bloom filter, index records, footer.
Segment visibility: only after seal and log append.

3.3 Append-Only Log

All store-visible mutations are recorded in a strictly ordered, append-only log:

Entries include:
- Index additions
- Tombstones
- Segment seals
Log is replayable to reconstruct CURRENT.
Log semantics are defined in ASL/LOG/1.

3.4 Snapshot Identity and Log Position

To make CURRENT referencable and replayable, ASL-STORE-INDEX defines:

SnapshotID: opaque, immutable identifier for a snapshot.
LogPosition: monotonic integer position in the append-only log.
IndexState: (SnapshotID, LogPosition).

Deterministic replay is defined as:

Index(SnapshotID, LogPosition) = Snapshot[SnapshotID] + replay(log[0:LogPosition])

Snapshots and log positions are required for checkpointing, federation, and deterministic recovery.

Implementation note (determinism): This repository interprets LogPosition as the inclusive logseq upper bound defined by ASL/LOG/1, not a byte offset into the log file. Snapshot anchors use their record logseq as the snapshot's log position.

3.5 Artifact Location

ArtifactExtent: (BlockID, offset, length) identifying a byte slice within a block.
ArtifactLocation: ordered list of ArtifactExtent values that, when concatenated, produce the artifact bytes.
Multi-extent locations allow a single artifact to be striped across multiple blocks.

4. PUT/GET Contract (Normative)

4.1 PUT Signature

put(artifact) -> (ArtifactKey, IndexState)

ArtifactKey is the content identity (ASL/1-CORE-INDEX).
IndexState = (SnapshotID, LogPosition) after the PUT is admitted.

4.2 PUT Semantics

Structural registration (if applicable): if a structural index (SID -> DAG) exists, it MUST register the artifact and reuse existing SID entries.
Materialization (if applicable): if the artifact is lazy, materialize deterministically to derive ArtifactKey.
Deduplication: lookup ArtifactKey at CURRENT. If present, PUT MUST succeed without writing bytes or adding a new index entry.
Storage: if absent, write bytes to one or more sealed blocks and produce ArtifactLocation.
Index mutation: append an index entry mapping ArtifactKey -> ArtifactLocation and record visibility via log order.

4.3 PUT Guarantees

PUT is idempotent for identical artifacts.
No visible index entry points to mutable or missing bytes.
Visibility follows log order and seal rules defined in this document.

4.4 GET Signature

get(ArtifactKey, IndexState?) -> bytes | NOT_FOUND

IndexState defaults to CURRENT when omitted.

4.5 GET Semantics

Resolve ArtifactKey -> ArtifactLocation using Index(snapshot, log_prefix).
If no entry exists, return NOT_FOUND.
Otherwise, read exactly the referenced (BlockID, offset, length) bytes and return them verbatim.

GET MUST NOT mutate state or trigger materialization.

4.6 Failure Semantics

Partial writes MUST NOT become visible.
Replay of snapshot + log after crash MUST reconstruct a valid CURRENT.
Implementations MAY use caching, but MUST preserve determinism.

5. Block Lifecycle Semantics

Event	Description	Semantic Guarantees
Creation	Block allocated; bytes may be written	Not visible to index until sealed
Sealing	Block is finalized and immutable	Sealed blocks are stable and safe to reference from index
Retention	Block remains accessible	Blocks referenced by snapshots or CURRENT must not be removed
Garbage Collection	Block may be deleted	Only unpinned, unreachable blocks may be removed

Notes:

Sealing ensures any index entry referencing the block is immutable.
Retention is driven by snapshot and log visibility rules.
GC must never violate CURRENT reconstruction guarantees.

6. Segment Lifecycle Semantics

5.1 Creation

Open segment is allocated.
Index entries appended in log order.
Entries are invisible until segment seal and log append.

5.2 Seal

Segment is closed to append.
Seal record is written to append-only log.
Segment becomes visible for lookup.
Sealed segment may be snapshot-pinned.

5.3 Snapshot Interaction

Snapshots capture sealed segments.
Open segments need not survive snapshot.
Segments below snapshot are replay anchors.

7. Visibility and Lookup Semantics

6.1 Visibility Rules

Entry visible iff:
- The block is sealed.
- Log record exists at position ≤ CURRENT.
- Segment seal recorded in log.
Entries above CURRENT or referencing unsealed blocks are invisible.

6.2 Lookup Semantics

To resolve an ArtifactKey:

Identify all visible segments ≤ CURRENT.
Search segments in reverse seal-log order (highest seal log position first).
Return first matching entry.
Respect tombstones to shadow prior entries.

Determinism:

Lookup results are identical across platforms given the same snapshot and log prefix.
Accelerations (bloom filters, sharding, SIMD) do not alter correctness.

8. Snapshot Interaction

Snapshots capture the set of sealed blocks and sealed index segments at a point in time.
Blocks referenced by a snapshot are pinned and cannot be garbage-collected until snapshot expiration.
CURRENT is reconstructed as:

CURRENT = snapshot_state + replay(log)

Segment and block visibility rules:

Entity	Visible in snapshot	Visible in CURRENT
Open segment/block	No	Only after seal and log append
Sealed segment/block	Yes, if included in snapshot	Yes, replayed from log
Tombstone	Yes, if log-recorded	Yes, shadows prior entries

9. Garbage Collection

Eligibility for GC:

Segments: sealed, no references from CURRENT or snapshots.
Blocks: unpinned, unreferenced by any segment or artifact.

Rules:

GC is safe only on sealed segments and blocks.
Must respect snapshot pins.
Tombstones may aid in invalidating unreachable blocks.
Snapshots retained for provenance or receipt verification MUST remain pinned.

Outcome:

GC never violates CURRENT reconstruction.
Blocks can be reclaimed without breaking provenance.

10. Tombstone Semantics

Optional marker to invalidate prior mappings.
Visibility rules identical to regular index entries.
Used to maintain deterministic CURRENT in face of shadowing or deletions.
scope and reason_code are policy metadata only; they do not affect shadowing order or replay determinism.
Tombstone lifts cancel only the referenced tombstone record for the same artifact; other tombstones remain effective until lifted.
Snapshot + log replay applies tombstones and lifts in logseq order; a lift that occurs after a snapshot becomes effective only when replay reaches its logseq.

11. Small vs Large Block Handling

11.1 Definitions

Term	Meaning
Small block	Block containing artifact bytes below a threshold `T_small`.
Large block	Block containing artifact bytes ≥ `T_small`.
Mixed segment	Segment containing both small and large blocks (discouraged).
Packing	Combining multiple small artifacts into a single physical block.
BlockID	Opaque identifier for a block; addressing is identical for all sizes.

Small vs large classification is store-level only and transparent to ASL-CORE and index layers. T_small is configurable per deployment.

11.2 Packing Rules

Small blocks may be packed together to reduce storage overhead.
Large blocks are never packed with other artifacts.
Mixed segments are allowed but discouraged; implementations MAY warn when mixing occurs.

11.3 Segment Allocation Rules

Small blocks are allocated into segments optimized for packing efficiency.
Large blocks are allocated into segments optimized for sequential I/O.
Segment sealing and visibility rules remain unchanged.

11.4 Indexing and Addressing

All blocks are addressed uniformly:

ArtifactExtent = (BlockID, offset, length)
ArtifactLocation = [ArtifactExtent...]

Packing does not affect index semantics or determinism. Multi-extent ArtifactLocations are allowed.

11.5 GC and Retention

Packed small blocks can be reclaimed only when all contained artifacts are unreachable.
Large blocks are reclaimed per block.

Invariant: GC must never remove bytes still referenced by CURRENT or snapshots.

12. Crash and Recovery Semantics

Open segments or unsealed blocks may be lost; no invariant is broken.
Recovery procedure:
1. Mount last checkpoint snapshot.
2. Replay append-only log from checkpoint.
3. Reconstruct CURRENT.
Recovery is deterministic and idempotent.
Segments and blocks never partially visible after crash.

13. Normative Invariants

Sealed blocks are immutable.
Index entries referencing blocks are immutable once visible.
Shadowing follows strict log order.
Replay of snapshot + log uniquely reconstructs CURRENT.
GC cannot remove blocks or segments needed by snapshot or CURRENT.
Tombstones shadow prior entries without deleting underlying blocks prematurely.
IndexState (SnapshotID, LogPosition) uniquely identifies CURRENT.

14. Non-Goals

Disk-level encoding (ENC-ASL-CORE-INDEX).
Memory layout or caching.
Sharding or performance heuristics.
Federation / multi-domain semantics (handled elsewhere).
Block packing strategies beyond the policy rules here.

15. Relationship to Other Layers

Layer	Responsibility
ASL-CORE	Artifact semantics, existence of blocks, immutability
ASL-CORE-INDEX	Semantic mapping of ArtifactKey → ArtifactLocation
ASL-STORE-INDEX	Lifecycle and operational contracts for blocks and segments
ENC-ASL-CORE-INDEX	Bytes-on-disk layout for segments, index records, and optional bloom filters

16. Summary

The tier1 ASL-STORE-INDEX specification:

Defines block lifecycle and segment lifecycle.
Makes snapshot identity and log positions explicit for replay.
Ensures deterministic visibility, lookup, and crash recovery.
Formalizes GC safety and tombstone behavior.
Adds clear small vs large block handling without changing core semantics.

14 KiB Raw Blame History