amduat-api/notes/multiple indexes.md
2026-01-17 00:19:49 +01:00

303 lines
6.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Yes — there are **multiple indexes**, and the key to keeping this clean is to **separate them by semantic responsibility**, exactly like your ASL-CORE-INDEX spec already does.
Below is a **concrete, minimal implementation plan** that fits *perfectly* with your current `artifact.py` and keeps q-bits cheap.
Ill give you:
1. The **index taxonomy** (what indexes exist and why)
2. A **clean architecture** (who depends on whom)
3. **Concrete Python implementations** for each index
4. How they interact during `put()` and `get()`
No hand-waving.
---
# 1⃣ The indexes you actually need
You need **three distinct indexes**, not one.
## A. Content Index (CID → ArtifactLocation)
**This *is* ASL-CORE-INDEX**
Purpose:
> Given a semantic artifact identity, where are the bytes?
```text
CID → (BlockID, offset, length)
```
Properties:
* Snapshot-relative
* Append-only
* Immutable
* Shadowable (optional)
This is the *storage-facing* index.
---
## B. Structural Index (SID → DAG node)
**This is your PEL / derivation index**
Purpose:
> Given a structure identity, what DAG node exists?
```text
SID → Artifact (structure + children + params)
```
Properties:
* In-memory or persisted
* Deterministic
* Rebuildable
* Does NOT imply materialization
This index is what lets you:
* compare derivations
* reuse structure
* deduplicate DAG nodes
* avoid rebuilding identical DAGs
---
## C. Materialization Cache (SID → CID)
**This is the execution shortcut**
Purpose:
> If I already materialized this structure, what content did it produce?
```text
SID → CID
```
Properties:
* Fully redundant
* Recomputable from DAG + content index
* Safe to drop entirely
* Huge performance win
This is the cache you already implemented — now we formalize it.
---
# 2⃣ Dependency graph (this matters)
```text
┌─────────────┐
│ Structural │
│ Index │ SID → DAG
└──────┬──────┘
materialize()
┌──────▼──────┐
│ Materialize │
│ Cache │ SID → CID
└──────┬──────┘
┌──────▼──────┐
│ Content │
│ Index │ CID → bytes
└─────────────┘
```
**Important invariant**:
* Content index never depends on DAG
* Structural index never depends on bytes
* Cache depends on both, but is optional
This keeps q-bits cheap.
---
# 3⃣ Concrete Python implementations
## A. Content Index (ASL-CORE-INDEX)
```python
# content_index.py
from dataclasses import dataclass
from typing import Dict, Optional
@dataclass(frozen=True)
class ArtifactLocation:
block_id: str
offset: int
length: int
class ContentIndex:
"""
Semantic index: CID -> ArtifactLocation
"""
def __init__(self):
self._index: Dict[str, ArtifactLocation] = {}
def get(self, cid: str) -> Optional[ArtifactLocation]:
return self._index.get(cid)
def put(self, cid: str, loc: ArtifactLocation):
# Immutable once visible
if cid in self._index:
return
self._index[cid] = loc
```
This is your **ASL-CORE-INDEX** in executable form.
---
## B. Structural Index (PEL / DAG index)
```python
# structural_index.py
from typing import Dict, Optional
from artifact import Artifact
class StructuralIndex:
"""
SID -> Artifact (DAG node)
"""
def __init__(self):
self._nodes: Dict[str, Artifact] = {}
def get(self, sid: str) -> Optional[Artifact]:
return self._nodes.get(sid)
def put(self, artifact: Artifact) -> Artifact:
"""
Deduplicate DAG nodes by SID.
"""
existing = self._nodes.get(artifact.sid)
if existing is not None:
return existing
self._nodes[artifact.sid] = artifact
return artifact
```
This ensures:
* One DAG node per SID
* Structural deduplication
* Cheap comparisons
---
## C. Materialization Cache (formalized)
You already have this — heres the final clean version:
```python
# materialization_cache.py
from typing import Dict, Optional
class MaterializationCache:
"""
Fully redundant cache: SID -> CID
"""
def __init__(self):
self._cache: Dict[str, str] = {}
def get(self, sid: str) -> Optional[str]:
return self._cache.get(sid)
def put(self, sid: str, cid: str):
self._cache[sid] = cid
```
---
# 4⃣ How `put()` actually works (step-by-step)
This is the **critical flow**.
### `put(artifact)`
```text
1. StructuralIndex.put(artifact)
└─ ensures DAG node exists
2. Try MaterializationCache.get(artifact.sid)
└─ if hit → CID known, skip computation
3. If miss:
a) Materialize DAG
b) Compute CID
c) Cache SID → CID
4. ContentIndex.get(CID)
└─ if exists → done (dedup)
5. Else:
a) Allocate block space
b) Write bytes
c) ContentIndex.put(CID → location)
```
**Key property**:
> If *any* index already knows the answer, work stops early.
Thats how you recover classical performance.
---
# 5⃣ Why this works for q-bits without cost
Because:
* q-bits only affect **step 3**
* steps 1, 4, 5 are identical to classical bytes
Once materialized:
* q-bit artifacts are indistinguishable from classical artifacts
* index sees only CIDs and byte ranges
No tax after first computation.
---
# 6⃣ What you now have (this is important)
You now have:
| Layer | Capability |
| ---------------- | ------------------------ |
| DAG | Explanation / provenance |
| SID | Structural identity |
| CID | Semantic identity |
| Content index | Storage truth |
| Cache | Performance |
| Snapshots + logs | Time |
This is **not common infrastructure**.
Thats why you couldnt find it pre-built.
---
## If you want next
I can:
* add **snapshot + append-only log semantics** to `ContentIndex`
* wire these indexes into a single `ASLStore`
* show **federation / replication** mechanics
* show **LLM → DAG → CID** ingestion
* formalize **PEL traversal queries**
Just point the direction.