amduat-api/notes/Runtime Execution Semantics & Memory Layout for SIMD-Shard Acceleration.md

Absolutely — here’s a formal draft for **runtime execution semantics and memory layout** for SIMD/shard acceleration of the unified execution plan over ENC-ASL-TGK-INDEX. This focuses on **parallel, deterministic, and snapshot-safe execution**.

---

# Runtime Execution Semantics & Memory Layout for SIMD/Shard Acceleration

---

## 1. Purpose

This specification defines:

* How operators in an execution plan are executed in memory
* How shards, SIMD, and filters are applied efficiently
* Determinism guarantees per snapshot
* Memory layout for index scans, filter evaluation, and traversal expansion

It is fully compatible with:

* **ENC-ASL-TGK-INDEX**
* **Merged ASL + TGK query execution plan**
* **C-struct operator definitions**

---

## 2. Memory Layout Principles

1. **Immutable segments**: Index segments are **read-only** during execution
2. **Shard-local buffers**: Each shard stores a segment of records in contiguous memory
3. **SIMD key arrays**: Routing keys, type tags, and edge type keys are stored in contiguous SIMD-aligned arrays for fast vectorized evaluation
4. **Canonical references**: artifact IDs and TGK edge IDs are stored in 64-bit aligned arrays for deterministic access
5. **Traversal buffers**: TGK traversal outputs are stored in logseq-sorted buffers to preserve determinism

---

## 3. Segment Loading and Sharding

* Each index segment is **assigned to a shard** based on routing key hash
* Segment header is mapped into memory; record arrays are memory-mapped if needed
* For ASL artifacts:

```c
struct shard_asl_segment {
    uint64_t *artifact_ids;       // 64-bit canonical IDs
    uint32_t *type_tags;          // optional type tags
    uint8_t  *has_type_tag;       // flags
    uint64_t record_count;
};
```

* For TGK edges:

```c
struct shard_tgk_segment {
    uint64_t *tgk_edge_ids;       // canonical TGK-CORE references
    uint32_t *edge_type_keys;
    uint8_t  *has_edge_type;
    uint8_t  *roles;              // from/to/both
    uint64_t record_count;
};
```

* **Shard-local buffers** allow **parallel SIMD evaluation** without inter-shard contention

---

## 4. SIMD-Accelerated Filter Evaluation

* SIMD applies vectorized comparison of:

  * Artifact type tags
  * Edge type keys
  * Routing keys (pre-hashed)
* Example pseudo-code (AVX2):

```c
for (i = 0; i < record_count; i += SIMD_WIDTH) {
    simd_load(type_tag[i:i+SIMD_WIDTH])
    simd_cmp(type_tag_filter)
    simd_mask_store(pass_mask, output_buffer)
}
```

* Determinism guaranteed by **maintaining original order** after filtering (logseq ascending + canonical ID tie-breaker)

---

## 5. Traversal Buffer Semantics (TGK)

* TGKTraversal operator maintains:

```c
struct tgk_traversal_buffer {
    uint64_t *edge_ids;        // expanded edges
    uint64_t *node_ids;        // corresponding nodes
    uint32_t  depth;           // current traversal depth
    uint64_t count;            // number of records in buffer
};
```

* Buffers are **logseq-sorted per depth** to preserve deterministic traversal
* Optional **per-shard buffers** for parallel traversal

---

## 6. Merge Operator Semantics

* Merges **multiple shard-local streams**:

```c
struct merge_buffer {
    uint64_t *artifact_ids;
    uint64_t *tgk_edge_ids;
    uint32_t  *type_tags;
    uint8_t   *roles;
    uint64_t   count;
};
```

* Merge algorithm: **deterministic heap merge**

  1. Compare `logseq` ascending
  2. Tie-break with canonical ID

* Ensures same output regardless of shard execution order

---

## 7. Tombstone Shadowing

* Shadowing is **applied post-merge**:

```c
struct tombstone_state {
    uint64_t canonical_id;
    uint64_t max_logseq_seen;
    uint8_t  is_tombstoned;
};
```

* Algorithm:

1. Iterate merged buffer
2. For each canonical ID, keep only **latest logseq ≤ snapshot**
3. Drop tombstoned or overridden entries

* Deterministic and **snapshot-safe**

---

## 8. Traversal Expansion with SIMD & Shards

* Input: TGK edge buffer, shard-local nodes
* Steps:

1. **Filter edges** using SIMD (type, role)
2. **Expand edges** to downstream nodes
3. **Append results** to depth-sorted buffer
4. Repeat for depth `d` if traversal requested
5. Maintain deterministic order:

   * logseq ascending
   * canonical edge ID tie-breaker

---

## 9. Projection & Aggregation Buffers

* Output buffer for projection:

```c
struct projection_buffer {
    uint64_t *artifact_ids;
    uint64_t *tgk_edge_ids;
    uint64_t *node_ids;
    uint32_t  *type_tags;
    uint64_t   count;
};
```

* Aggregation performed **in-place** or into **small accumulator structures**:

```c
struct aggregation_accumulator {
    uint64_t count;
    uint64_t sum_type_tag;
    // additional aggregates as needed
};
```

* Deterministic due to **logseq + canonical ID ordering**

---

## 10. Parallelism and SIMD Determinism

* **Shard-local parallelism** allowed

* **SIMD vectorization** allowed

* Global determinism ensured by:

  1. Per-shard deterministic processing
  2. Deterministic merge of shards
  3. Shadowing/tombstone application post-merge
  4. Logseq + canonical ID ordering preserved

* This guarantees **identical results across runs and nodes**

---

## 11. Summary of Runtime Buffers

| Buffer                    | Contents                  | Purpose                           |
| ------------------------- | ------------------------- | --------------------------------- |
| `shard_asl_segment`       | artifact IDs, type tags   | parallel scan/filter              |
| `shard_tgk_segment`       | edge IDs, edge type, role | parallel scan/filter              |
| `tgk_traversal_buffer`    | expanded edges/nodes      | DAG traversal                     |
| `merge_buffer`            | merged records            | deterministic merge across shards |
| `projection_buffer`       | projected fields          | final output                      |
| `aggregation_accumulator` | counts/sums               | deterministic aggregation         |
| `tombstone_state`         | canonical ID + logseq     | shadowing                         |

* All buffers are **snapshot-bound and immutable where possible**
* SIMD/shard acceleration does not affect determinism

---

This completes a **full runtime execution model**, including **memory layout, SIMD/shard acceleration, traversal buffers, merge logic, tombstone handling, and projection/aggregation**.

---

Next steps could be:

1. **Formal operator pseudocode in C**, implementing SIMD and shard logic
2. **Memory-mapped index access routines** for ENC-ASL-TGK-INDEX
3. **Integration with PEL deterministic DAG execution receipts**

Do you want me to draft the **operator pseudocode next**?