Carl Niklas Rydberg a4932b1217 PEL/TRACE-DAG: wire exec_result_ref + node failure diagnostics

Persist pre-trace ExecutionResult to embed exec_result_ref in traces
Capture node-level runtime diagnostics and clone into trace artifacts
Clarify trace spec for pre-trace result linkage
Add tests for exec_result_ref and node-failure diagnostics

2025-12-22 11:16:23 +01:00

24 KiB

Raw Permalink Blame History

PEL/TRACE-DAG/1 — DAG Execution Trace Profile

Status: Approved Owner: Niklas Rydberg Version: 0.2.1 SoT: Yes Last Updated: 2025-11-16 Linked Phase Pack: N/A Tags: [execution, traceability]

Document ID: PEL/TRACE-DAG/1 Layer: L1 Scheme Trace Profile (on top of PEL/1-CORE + PEL/PROGRAM-DAG/1)

Depends on (normative):

ASL/1-CORE v0.4.x — value model (Artifact, Reference, TypeTag, integers, OctetString)
PEL/1-CORE v0.3.x — primitive execution layer (ExecutionStatus, ExecutionErrorSummary, diagnostics)
PEL/PROGRAM-DAG/1 v0.3.1 — DAG Program scheme (Program, Node, NodeId, canonical topological order)

Integrates with (informative):

PEL/1-SURF v0.2.x — store-backed execution surface
ENC/PEL-TRACE-DAG/1 (planned) — canonical encoding for DAG traces
TGK/1-CORE — trace graph kernel (execution edges)
ENC/ASL1-CORE v1.0.x — canonical Artifact encoding
HASH/ASL1 v0.2.4 — ASL1 hash family

License

Except where otherwise noted, this document (text and diagrams) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

The identifier registries and mapping tables (e.g. TypeTag IDs, HashId assignments, EdgeTypeId tables) are additionally made available under CC0 1.0 Universal (CC0) to enable unrestricted reuse in implementations and derivative specifications.

Code examples in this document are provided under the Apache License 2.0 unless explicitly stated otherwise. Test vectors, where present, are dedicated to the public domain under CC0 1.0.

0. Overview

PEL/TRACE-DAG/1 defines a standard trace value for executions of the PEL/PROGRAM-DAG/1 scheme:

It records per-node status and outputs in canonical node order.
It links the trace to:
- the Program Artifact (program_ref),
- the inputs and optional params used by the run,
- the overall scheme and (optionally) the ExecutionResult Artifact.

The trace is represented as a single ASL/1 Artifact (a TraceArtifact) whose payload is a TraceDAGValue (defined below) and whose TypeTag is dedicated to this profile.

The trace is:

Optional — some deployments may choose not to record it.
Deterministic — for fixed inputs and Program, all conformant implementations produce identical traces.
Graph-friendly — TGK/1 can interpret it into node-level execution edges for provenance.

Binary layout and TypeTag assignment are defined in a companion encoding profile (ENC/PEL-TRACE-DAG/1); this document specifies only the logical model and semantics.

1. Purpose & Non-Goals

1.1 Purpose

The goals of PEL/TRACE-DAG/1 are to:

Provide a canonical node-level trace for PEL/PROGRAM-DAG/1 runs.
Capture for each Node:
- whether it was executed,
- whether it succeeded or failed,
- the References of any Artifacts it produced,
- deterministic diagnostic entries.
Link node-level information to:
- the Program that defined the DAG,
- the input and params Artifacts,
- the run-level ExecutionStatus / ExecutionErrorSummary.

This trace enables:

reconstruction of per-node execution edges in TGK/1,
human and machine debugging of runs,
post-hoc provenance analysis and selective re-execution.

1.2 Non-Goals

This profile does not define:

Graph/provenance edges themselves — TGK/1 defines how traces become edges.
Concrete binary encodings — these belong to ENC/PEL-TRACE-DAG/1.
How and when traces are enabled — that is a policy/configuration decision.
Store interactions — those belong to ASL/1-STORE and PEL/1-SURF.
Semantics of operations (OperationId), parameter schemas, or error codes — those belong to operation registries (e.g. OPREG/PEL1-KERNEL and extensions).

2. Context and Layering

2.1 Relationship to PEL/PROGRAM-DAG/1

PEL/PROGRAM-DAG/1 defines:

The Program data model (Program, Node, NodeId, RootRef).
Structural validity rules (unique NodeIds, DAG constraint, etc.).
A canonical topological order of Nodes.

PEL/TRACE-DAG/1 assumes:

The run was performed against a particular Program Artifact (program_ref).
The Program is structurally valid under PEL/PROGRAM-DAG/1.
Node evaluation order is the scheme’s canonical topological order.

This profile adds:

A TraceDAGValue expressed in terms of that Program’s NodeIds and operations.

2.2 Relationship to PEL/1-CORE and PEL/1-SURF

At the PEL/1 level, a scheme s is bound to a pure function:

Exec_s(
  program: Artifact,
  inputs:  list<Artifact>,
  params:  optional Artifact
) -> (outputs: list<Artifact>, result: ExecutionResultValue)

For the DAG scheme (PEL/PROGRAM-DAG/1):

Exec_DAG plays that role.

PEL/TRACE-DAG/1 sits just above:

It defines the shape of a TraceDAGValue that can be produced in addition to the outputs and ExecutionResultValue.
A PEL/1 surface (e.g. PEL/1-SURF) that implements this profile:
- persists this trace as an Artifact with a dedicated TypeTag,
- obtains its Reference via ASL/1-STORE.put,
- exposes that Reference as trace_ref in its surface-level ExecutionResult Artifact.

If Exec_DAG is not invoked at all (e.g. due to store-level failures handled in PEL/1-SURF), then PEL/TRACE-DAG/1 is simply not applied: no TraceDAGValue and no trace Artifact are produced for that run.

2.3 Relationship to TGK/1

TGK/1 can treat a TraceDAGValue as:

A per-node execution log, keyed by program_ref and NodeId.
A list of produced Artifacts (output_refs per node).
A basis for edges like:
- “node N (operation O) produced artifact A by reading B, C, …”,
- “this run (ExecutionResult) used Program P and generated outputs X, Y, …”.

This document does not prescribe exact edge types; it only ensures traces are structured enough for TGK/1 to derive such edges deterministically.

3. Data Model

3.1 Reused Types

From ASL/1-CORE:

Reference {
  hash_id: HashId
  digest:  OctetString
}

HashId = uint16

From PEL/1-CORE:

ExecutionStatus = uint8
ExecutionErrorKind = uint8

ExecutionErrorSummary {
  kind:        ExecutionErrorKind
  status_code: uint32
}

PEL/1-CORE defines the shared meanings of:

ExecutionStatus {
  OK                = 0
  SCHEME_UNSUPPORTED= 1
  INVALID_PROGRAM   = 2
  INVALID_INPUTS    = 3
  RUNTIME_FAILED    = 4
}

ExecutionErrorKind {
  NONE    = 0
  SCHEME  = 1
  PROGRAM = 2
  INPUTS  = 3
  RUNTIME = 4
}

From PEL/PROGRAM-DAG/1:

NodeId = uint32

Program { nodes: list<Node>; roots: list<RootRef>; }

Node {
  id:     NodeId
  op:     OperationId
  inputs: list<DagInput>
  params: ParamsBytes
}

OperationId {
  name:    string   // logical UTF-8 name
  version: uint32
}

The encoding of string and ParamsBytes is defined by ENC/PEL-PROGRAM-DAG/1.

3.2 DiagnosticEntry

For trace-level diagnostics, this profile reuses the generic diagnostic shape used in PEL/1:

DiagnosticEntry {
  code:    uint32       // scheme- or op-specific diagnostic code
  message: OctetString // often UTF-8 text; interpretation is profile-specific
}

Requirements:

code is intended for machine use; message is for human-readable information.
Both MUST be deterministic for a given run.

3.3 NodeTraceStatus

Node-level status distinguishes whether the node:

ran and succeeded,
ran and failed at runtime,
did not run due to an earlier failure.

NodeTraceStatus = uint8

NodeTraceStatus {
  NODE_OK       = 0   // Node executed successfully
  NODE_FAILED   = 1   // Node executed and failed (runtime failure)
  NODE_SKIPPED  = 2   // Node was not executed due to earlier run-level failure
}

Invariants per status:

NODE_OK:
- The node’s operation executed successfully.
- status_code MUST be 0.
- output_refs MAY be empty or non-empty, depending on the operation.
NODE_FAILED:
- The node’s operation was invoked and returned a runtime error.
- status_code MUST be non-zero.
- output_refs MUST be empty (this profile treats failing nodes as producing no durable outputs).
NODE_SKIPPED:
- The node was not executed because the run terminated early (RUNTIME_FAILED, INVALID_INPUTS, or INVALID_PROGRAM).
- status_code MUST be 0.
- output_refs MUST be empty.
- diagnostics MAY contain a short deterministic explanation, but SHOULD be empty in minimal profiles.

Relation to run-level ExecutionStatus:

If the run-level status = OK, then every NodeTraceDAG.status MUST be NODE_OK.
If any NodeTraceDAG.status = NODE_FAILED, then the run-level status MUST be RUNTIME_FAILED.
For status ∈ { INVALID_PROGRAM, INVALID_INPUTS }, node statuses MAY be NODE_OK (for nodes that executed before the invalid condition was detected) or NODE_SKIPPED, but MUST NOT be NODE_FAILED (runtime failures always map to RUNTIME_FAILED).

3.4 NodeTraceDAG

A node-level trace entry:

NodeTraceDAG {
  node_id:     NodeId
  op_name:     string    // duplicate of Program.nodes[i].op.name
  op_version:  uint32    // duplicate of Program.nodes[i].op.version

  status:      NodeTraceStatus
  status_code: uint32    // 0 = success; non-zero = op-specific failure code for NODE_FAILED

  output_refs: list<Reference>      // artifacts produced by this node (if any) in store space
  diagnostics: list<DiagnosticEntry>
}

Requirements:

node_id MUST correspond to some Node.id in the Program identified by program_ref.
op_name / op_version MUST match the OperationId of that Node as decoded from the Program Artifact. They are a denormalized copy so traces remain understandable even if the Program is not immediately available.
output_refs:
- MUST list, in order, the References obtained when persisting this node’s output Artifacts (via ASL/1-STORE.put) in the Store used for this run.
- MAY be empty if the operation logically has no outputs, or the node failed / was skipped.
diagnostics:
- For NODE_OK, MAY be empty or contain non-fatal diagnostics (e.g. warnings) if the operation registry defines them.
- For NODE_FAILED, SHOULD contain at least one entry describing the failure.
- For NODE_SKIPPED, MAY be empty or contain a short deterministic explanation.

3.5 TraceDAGValue

The top-level trace for a DAG run:

TraceDAGValue {
  pel1_version:  uint16          // MUST be 1 for this version
  scheme_ref:    Reference       // MUST be SchemeRef_DAG_1
  program_ref:   Reference       // Program Artifact used by this run

  status:        ExecutionStatus
  summary:       ExecutionErrorSummary

  exec_result_ref: optional Reference   // Reference to surface ExecutionResult Artifact (if available)

  input_refs:    list<Reference>        // in the same order as Exec_DAG.inputs
  params_ref:    optional Reference     // params Artifact used for this run, if any

  node_traces:   list<NodeTraceDAG>     // node-level traces in canonical node order
}

Constraints:

pel1_version:
- MUST be 1 for traces produced under this version of the spec.
scheme_ref:
- MUST equal SchemeRef_DAG_1 for this scheme.
program_ref:
- MUST be the Reference of the Program Artifact passed to the run.
status and summary:
- MUST match the ExecutionStatus and ExecutionErrorSummary of the run-level ExecutionResultValue produced by Exec_DAG.
- summary.kind and summary.status_code MUST obey the mapping defined for PEL/PROGRAM-DAG/1.
exec_result_ref:
- If the run produced a surface-level ExecutionResult Artifact (as in PEL/1-SURF), this SHOULD be its Reference.
- If no such Artifact exists or is not persisted, it MUST be absent.
- If a surface persists an ExecutionResult Artifact that includes trace_ref, it MAY still set exec_result_ref to a distinct pre-trace ExecutionResult Artifact for the same run to avoid a circular dependency between Artifacts. In that case, the surface ExecutionResult Artifact that carries trace_ref is the canonical surface result for that run, while exec_result_ref exists solely to link the trace back to an execution result.
input_refs:
- MUST be the list of References corresponding to the inputs passed to Exec_DAG, in the same order.
params_ref:
- If a params Artifact was provided for the run, MUST be that Reference.
- If no params Artifact was provided (or the scheme ignores params), MUST be absent.
node_traces:
- If the Program was successfully decoded and structurally valid and at least one Node was attempted, node_traces MUST contain:
  - exactly one NodeTraceDAG per Node in the Program, and
  - in the canonical node order defined by PEL/PROGRAM-DAG/1.
- For runs that fail before any Node is attempted (e.g. INVALID_PROGRAM due to decode failure, or INVALID_INPUTS detected before the first Node), node_traces MUST be empty, because no node evaluation took place.

4. Trace Semantics for a Single Run

This section defines how a conformant engine or surface SHOULD construct TraceDAGValue for a single PEL/PROGRAM-DAG/1 run, assuming Exec_DAG was actually invoked.

We consider a run characterized by:

program_ref : Reference
program_artifact : Artifact (whose bytes decode to a Program under PEL/PROGRAM-DAG/1 + its encoding profile)
input_refs : list<Reference>
inputs : list<Artifact> (resolved from a Store)
params_ref : optional Reference
params_artifact : optional Artifact
status : ExecutionStatus
summary : ExecutionErrorSummary
Internal records of per-node results:
- node_outputs : NodeId -> list<Artifact> (for nodes that executed successfully),
- node_runtime_errors : NodeId -> (status_code, diagnostics) (for nodes that failed at runtime),
- and a notion of which Nodes were evaluated.

Implementations MAY organize their internal state differently; the following is conceptual.

4.1 Trace for successful runs (`status = OK`)

If status = OK:

Program and ordering
- The Program MUST be structurally valid (per PEL/PROGRAM-DAG/1).
- The engine MUST know the canonical node order (topological order with NodeId tie-breakers).

Node traces

For each Node in canonical order:

NodeTraceDAG {
  node_id     = Node.id
  op_name     = Node.op.name
  op_version  = Node.op.version
  status      = NODE_OK
  status_code = 0
  output_refs = [ R_0, ..., R_(k-1) ]   // refs from store.put on each output Artifact
  diagnostics = D                      // deterministic diagnostics, possibly empty
}

No nodes are NODE_FAILED or NODE_SKIPPED in a run with status = OK.

Top-level fields

TraceDAGValue MUST be:

pel1_version        = 1
scheme_ref          = SchemeRef_DAG_1
program_ref         = program_ref

status              = OK
summary.kind        = NONE
summary.status_code = 0

exec_result_ref     = execution_result_ref (if available, else absent)

input_refs          = input_refs (exact order used in run)
params_ref          = params_ref (if any)

node_traces         = [ NodeTraceDAG in canonical node order ]

4.2 Trace for runtime failures (`status = RUNTIME_FAILED`)

If status = RUNTIME_FAILED and at least one Node executed:

Nodes before the first failing node
- For Nodes that executed successfully before the failure:
  - Same as in §4.1: status = NODE_OK, status_code = 0, output_refs as persisted.

Failing node

For the first Node that failed at runtime:

NodeTraceDAG {
  node_id     = Node.id
  op_name     = Node.op.name
  op_version  = Node.op.version
  status      = NODE_FAILED
  status_code = runtime_status_code  // from operation semantics
  output_refs = []                   // no durable outputs recorded
  diagnostics = runtime_diagnostics  // deterministic list
}

Nodes after the failing node

For all Nodes that were not executed because the run terminated:

NodeTraceDAG {
  node_id     = Node.id
  op_name     = Node.op.name
  op_version  = Node.op.version
  status      = NODE_SKIPPED
  status_code = 0
  output_refs = []
  diagnostics = []   // MAY contain a short deterministic message, but SHOULD be empty
}

Top-level fields

TraceDAGValue MUST set:

status              = RUNTIME_FAILED
summary.kind        = RUNTIME
summary.status_code = status_code  // non-zero, matching ExecutionResultValue.summary.status_code

// other fields as in §4.1

4.3 Trace for invalid inputs / invalid program

For status = INVALID_INPUTS or status = INVALID_PROGRAM (as produced by Exec_DAG, not by store-level surfaces):

If no Node is executed at all:
- node_traces MUST be empty.
- status and summary MUST match the run-level ExecutionResultValue.
- It is RECOMMENDED to include at least one diagnostic (in summary or operation-level diagnostics) indicating the error (e.g., which input index was missing, or why the Program was invalid).
If the engine partially evaluates the Program before detecting an invalid condition (allowed only where consistent with PEL/PROGRAM-DAG/1):
- Node traces for executed Nodes MUST be recorded with status = NODE_OK (no runtime failure).
- Nodes that were never reached MUST be marked NODE_SKIPPED.
- No NodeTraceDAG MAY have status = NODE_FAILED in runs where status ∈ { INVALID_PROGRAM, INVALID_INPUTS }.

In all cases, the mapping from the precise validation failure to node-level traces MUST be deterministic for a given run and MUST be consistent with the run-level status and summary.

5. Determinism

5.1 Determinism contract

For fixed:

program_artifact.bytes,
input_refs + the contents of their referenced Artifacts in the Store,
params_ref + its Artifact contents (if present),
the same operation registries and parameter profiles for all OperationIds referenced in the Program,

then:

All conformant implementations that:

execute the run under the PEL/PROGRAM-DAG/1 scheme, and

produce a TraceDAGValue under PEL/TRACE-DAG/1

MUST produce identical TraceDAGValue logical values.

This implies:

Identical status and summary.
Identical input_refs, params_ref, and program_ref.
Identical node_traces sequence, with:
- same node_id, op_name, op_version,
- same NodeTraceStatus, status_code,
- identical output_refs (same References, same order),
- identical diagnostics.

5.2 No ambient environment

Trace construction MUST NOT depend on:

host clocks, random number generators, environment variables,
process IDs, thread IDs, or other non-deterministic identifiers,
non-deterministic logging or scheduling.

Any data that appears in diagnostics or status_code MUST be determined solely by:

the Program value,
input/params Artifacts,
deterministic operation semantics,
and deterministic scheme-level logic.

6. Interaction with PEL/1-SURF (Informative)

A PEL/1 surface that implements this profile typically proceeds as follows for runs where Exec_DAG is invoked:

Resolve inputs
- Use ASL/1-STORE.get to resolve program_ref, input_refs, and params_ref (if present).
- Handle Store-level errors per PEL/1-SURF (mapping to INVALID_INPUTS / INVALID_PROGRAM and not calling Exec_DAG). In these cases, no trace Artifact is produced under this profile.
Run Exec_DAG
- Call:
```
(outputs, exec_result_value) =
  Exec_DAG(program_artifact, input_artifacts, params_artifact?)
```
- Persist outputs and the surface-level ExecutionResult Artifact, yielding output_refs and exec_result_ref.
Construct TraceDAGValue
- Using:
  - program_ref,
  - input_refs,
  - params_ref,
  - exec_result_ref,
  - exec_result_value.status and summary,
  - node-level outputs and errors recorded during execution,
- Construct TraceDAGValue as in §3–§4.

Persist TraceArtifact

Encode TraceDAGValue as an ASL/1 Artifact:

TraceArtifact {
  bytes    = TraceDAGBytes    // per ENC/PEL-TRACE-DAG/1
  type_tag = TYPE_TAG_PEL_TRACE_DAG_1
}

Call ASL/1-STORE.put(TraceArtifact) to obtain trace_ref.

Expose trace_ref
- Include trace_ref in the surface-level ExecutionResult Artifact (as an optional field).
- TGK/1 and higher layers can then traverse from ExecutionResult → trace_ref → TraceDAGValue → per-node output_refs.

This flow is informative; implementations MAY pipeline or optimize, provided the observable TraceDAGValue is unchanged.

7. Conformance

An implementation is PEL/TRACE-DAG/1–conformant if it:

Implements the TraceDAGValue model
- Produces trace values that satisfy all field constraints in §3.
- Sets pel1_version = 1 and scheme_ref = SchemeRef_DAG_1.
Respects status alignment
- Ensures TraceDAGValue.status and TraceDAGValue.summary always match the run-level ExecutionResultValue produced for the same run.
- Ensures that the rules relating NodeTraceStatus and run-level ExecutionStatus (§3.3, §4) are upheld.
Maintains canonical node order and coverage
- For structurally valid Programs where Exec_DAG evaluated at least one Node, emits node_traces:
  - in the canonical topological order defined by PEL/PROGRAM-DAG/1, and
  - with exactly one NodeTraceDAG per Program Node.
- For runs where no Node is attempted, emits node_traces = [].
Uses NodeTraceStatus correctly
- Uses NODE_OK, NODE_FAILED, NODE_SKIPPED with semantics defined in §3.3 and §4.
- Ensures status_code and output_refs obey the rules for each status.
Ensures determinism
- For fixed inputs and Program, always produces identical TraceDAGValue (across runs, machines, implementations).
- Does not inject host or environment variability into the trace.
Separates concerns
- Does not conflate Store-level errors (e.g., ERR_NOT_FOUND) with node-level runtime failures; Store errors are handled in PEL/1-SURF and MUST NOT produce spurious NodeTraceDAG entries.
- Does not encode provenance graph semantics in the trace; it only provides structured data that TGK/1 can interpret.

8. Security and Privacy Considerations

Information exposure
- TraceDAGValue includes:
  - references to input Artifacts,
  - references to intermediate outputs,
  - per-node diagnostics.
- Anyone with access to the trace and the Store may gain insight into:
  - data flow,
  - intermediate values,
  - failure modes.
- Domains concerned with secrecy MUST treat trace Artifacts as sensitive and control access appropriately (e.g., by Store policy or overlays).
Trace volume
- Large Programs with many Nodes and outputs can produce large traces.
- Implementations SHOULD consider:
  - trace sampling or truncation (but MUST keep it deterministic if used),
  - separate “debug” vs “production” trace policies.
- Any truncation or sampling profile MUST be expressed as a separate spec/profile; this baseline assumes full traces.
Integrity
- Trace integrity is rooted in:
  - HASH/ASL1 (for Artifact identity),
  - Store semantics (ASL/1-STORE),
  - optional certification (CIL/1).
- Tampering with trace Artifacts is detectable via mismatches between:
  - ExecutionResult,
  - trace,
  - Program,
  - and actual Store content.
Non-repudiation
- By itself, TraceDAGValue does not prove that a run occurred in any particular environment; it is just data.
- Non-repudiation requires certification and policy layers (FER/1, FCT/1, CIL/1).

End of PEL/TRACE-DAG/1 v0.2.1 — DAG Execution Trace Profile

Document History

0.2.1 (2025-11-16): Registered as Tier-1 spec and aligned to the Amduat 2.0 substrate baseline.

24 KiB Raw Permalink Blame History Unescape Escape