amduat-api/notes/multiple indexes.md

303 lines
6.3 KiB
Markdown
Raw Normal View History

Yes — there are **multiple indexes**, and the key to keeping this clean is to **separate them by semantic responsibility**, exactly like your ASL-CORE-INDEX spec already does.
Below is a **concrete, minimal implementation plan** that fits *perfectly* with your current `artifact.py` and keeps q-bits cheap.
Ill give you:
1. The **index taxonomy** (what indexes exist and why)
2. A **clean architecture** (who depends on whom)
3. **Concrete Python implementations** for each index
4. How they interact during `put()` and `get()`
No hand-waving.
---
# 1⃣ The indexes you actually need
You need **three distinct indexes**, not one.
## A. Content Index (CID → ArtifactLocation)
**This *is* ASL-CORE-INDEX**
Purpose:
> Given a semantic artifact identity, where are the bytes?
```text
CID → (BlockID, offset, length)
```
Properties:
* Snapshot-relative
* Append-only
* Immutable
* Shadowable (optional)
This is the *storage-facing* index.
---
## B. Structural Index (SID → DAG node)
**This is your PEL / derivation index**
Purpose:
> Given a structure identity, what DAG node exists?
```text
SID → Artifact (structure + children + params)
```
Properties:
* In-memory or persisted
* Deterministic
* Rebuildable
* Does NOT imply materialization
This index is what lets you:
* compare derivations
* reuse structure
* deduplicate DAG nodes
* avoid rebuilding identical DAGs
---
## C. Materialization Cache (SID → CID)
**This is the execution shortcut**
Purpose:
> If I already materialized this structure, what content did it produce?
```text
SID → CID
```
Properties:
* Fully redundant
* Recomputable from DAG + content index
* Safe to drop entirely
* Huge performance win
This is the cache you already implemented — now we formalize it.
---
# 2⃣ Dependency graph (this matters)
```text
┌─────────────┐
│ Structural │
│ Index │ SID → DAG
└──────┬──────┘
materialize()
┌──────▼──────┐
│ Materialize │
│ Cache │ SID → CID
└──────┬──────┘
┌──────▼──────┐
│ Content │
│ Index │ CID → bytes
└─────────────┘
```
**Important invariant**:
* Content index never depends on DAG
* Structural index never depends on bytes
* Cache depends on both, but is optional
This keeps q-bits cheap.
---
# 3⃣ Concrete Python implementations
## A. Content Index (ASL-CORE-INDEX)
```python
# content_index.py
from dataclasses import dataclass
from typing import Dict, Optional
@dataclass(frozen=True)
class ArtifactLocation:
block_id: str
offset: int
length: int
class ContentIndex:
"""
Semantic index: CID -> ArtifactLocation
"""
def __init__(self):
self._index: Dict[str, ArtifactLocation] = {}
def get(self, cid: str) -> Optional[ArtifactLocation]:
return self._index.get(cid)
def put(self, cid: str, loc: ArtifactLocation):
# Immutable once visible
if cid in self._index:
return
self._index[cid] = loc
```
This is your **ASL-CORE-INDEX** in executable form.
---
## B. Structural Index (PEL / DAG index)
```python
# structural_index.py
from typing import Dict, Optional
from artifact import Artifact
class StructuralIndex:
"""
SID -> Artifact (DAG node)
"""
def __init__(self):
self._nodes: Dict[str, Artifact] = {}
def get(self, sid: str) -> Optional[Artifact]:
return self._nodes.get(sid)
def put(self, artifact: Artifact) -> Artifact:
"""
Deduplicate DAG nodes by SID.
"""
existing = self._nodes.get(artifact.sid)
if existing is not None:
return existing
self._nodes[artifact.sid] = artifact
return artifact
```
This ensures:
* One DAG node per SID
* Structural deduplication
* Cheap comparisons
---
## C. Materialization Cache (formalized)
You already have this — heres the final clean version:
```python
# materialization_cache.py
from typing import Dict, Optional
class MaterializationCache:
"""
Fully redundant cache: SID -> CID
"""
def __init__(self):
self._cache: Dict[str, str] = {}
def get(self, sid: str) -> Optional[str]:
return self._cache.get(sid)
def put(self, sid: str, cid: str):
self._cache[sid] = cid
```
---
# 4⃣ How `put()` actually works (step-by-step)
This is the **critical flow**.
### `put(artifact)`
```text
1. StructuralIndex.put(artifact)
└─ ensures DAG node exists
2. Try MaterializationCache.get(artifact.sid)
└─ if hit → CID known, skip computation
3. If miss:
a) Materialize DAG
b) Compute CID
c) Cache SID → CID
4. ContentIndex.get(CID)
└─ if exists → done (dedup)
5. Else:
a) Allocate block space
b) Write bytes
c) ContentIndex.put(CID → location)
```
**Key property**:
> If *any* index already knows the answer, work stops early.
Thats how you recover classical performance.
---
# 5⃣ Why this works for q-bits without cost
Because:
* q-bits only affect **step 3**
* steps 1, 4, 5 are identical to classical bytes
Once materialized:
* q-bit artifacts are indistinguishable from classical artifacts
* index sees only CIDs and byte ranges
No tax after first computation.
---
# 6⃣ What you now have (this is important)
You now have:
| Layer | Capability |
| ---------------- | ------------------------ |
| DAG | Explanation / provenance |
| SID | Structural identity |
| CID | Semantic identity |
| Content index | Storage truth |
| Cache | Performance |
| Snapshots + logs | Time |
This is **not common infrastructure**.
Thats why you couldnt find it pre-built.
---
## If you want next
I can:
* add **snapshot + append-only log semantics** to `ContentIndex`
* wire these indexes into a single `ASLStore`
* show **federation / replication** mechanics
* show **LLM → DAG → CID** ingestion
* formalize **PEL traversal queries**
Just point the direction.