chore: bootstrap assistant platform baseline
This commit is contained in:
commit
912f8ebc56
28
.gitignore
vendored
Normal file
28
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,28 @@
|
||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*.so
|
||||||
|
.venv/
|
||||||
|
venv/
|
||||||
|
|
||||||
|
# Env/secrets
|
||||||
|
.env
|
||||||
|
.env.*
|
||||||
|
*.key
|
||||||
|
*.pem
|
||||||
|
|
||||||
|
# Local/runtime
|
||||||
|
logs/
|
||||||
|
*.log
|
||||||
|
runs.db
|
||||||
|
|
||||||
|
# OS/editor
|
||||||
|
.DS_Store
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
|
||||||
|
# Build/temp
|
||||||
|
build/
|
||||||
|
dist/
|
||||||
|
.tmp/
|
||||||
|
tmp/
|
||||||
27
MESSAGES_RELEASE_FLOW.md
Normal file
27
MESSAGES_RELEASE_FLOW.md
Normal file
|
|
@ -0,0 +1,27 @@
|
||||||
|
# Messages Release Flow
|
||||||
|
|
||||||
|
This flow creates a Nessie tag for `lake.db1.messages`, generates a manifest JSON, and appends a row to `lake.db1.releases_v2`.
|
||||||
|
|
||||||
|
## Run on lakehouse-core
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh niklas@lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./create-messages-release-via-spark-container.sh'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Custom release name
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh niklas@lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./create-messages-release-via-spark-container.sh rel_2026-02-14_messages-v1'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Outputs
|
||||||
|
|
||||||
|
- Manifest file written to `./manifests/<release_name>.json`
|
||||||
|
- Nessie tag `<release_name>` created at current `main` hash (or reused if already present)
|
||||||
|
- Registry row appended to `lake.db1.releases_v2`
|
||||||
|
|
||||||
|
## Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh niklas@lakehouse-core.rakeroots.lan "docker exec spark /opt/spark/bin/spark-sql --properties-file /opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf --packages 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5' -e \"SELECT release_name, table_identifier, snapshot_id, created_at_utc FROM lake.db1.releases_v2 WHERE table_identifier='lake.db1.messages' ORDER BY created_at_utc DESC LIMIT 10\""
|
||||||
|
```
|
||||||
23
MESSAGES_SCHEMA.md
Normal file
23
MESSAGES_SCHEMA.md
Normal file
|
|
@ -0,0 +1,23 @@
|
||||||
|
# Messages Schema
|
||||||
|
|
||||||
|
Creates Iceberg table `lake.db1.messages` with ingest fields:
|
||||||
|
|
||||||
|
- `thread_id` STRING
|
||||||
|
- `message_id` STRING
|
||||||
|
- `sender` STRING
|
||||||
|
- `channel` STRING
|
||||||
|
- `sent_at` TIMESTAMP
|
||||||
|
- `body` STRING
|
||||||
|
- `metadata_json` STRING
|
||||||
|
|
||||||
|
## Run on lakehouse-core
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh niklas@lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./create-messages-table-via-spark-container.sh'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh niklas@lakehouse-core.rakeroots.lan "docker exec spark /opt/spark/bin/spark-sql --properties-file /opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf --packages 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5' -e 'DESCRIBE TABLE lake.db1.messages'"
|
||||||
|
```
|
||||||
142
PROJECTOR_USAGE.md
Normal file
142
PROJECTOR_USAGE.md
Normal file
|
|
@ -0,0 +1,142 @@
|
||||||
|
# Release Projector
|
||||||
|
|
||||||
|
`release_projector.py` rebuilds serving projections (JanusGraph + Elasticsearch) from a lakehouse release manifest.
|
||||||
|
|
||||||
|
## What it does
|
||||||
|
|
||||||
|
1. Loads a release manifest JSON (or a `releases_v2` row containing `manifest_json`).
|
||||||
|
2. Resolves Nessie tag/ref from the manifest (or `--nessie-ref`).
|
||||||
|
3. Reads the concept Iceberg table from that ref through Spark + Iceberg + Nessie.
|
||||||
|
4. Upserts each concept into JanusGraph and Elasticsearch.
|
||||||
|
|
||||||
|
`release_projector.py` now accepts both concept-shaped rows and document-shaped rows.
|
||||||
|
For docs tables, it auto-detects typical columns:
|
||||||
|
- name: `canonical_name|title|name|subject`
|
||||||
|
- id: `concept_id|doc_id|document_id|id|uuid`
|
||||||
|
- summary text: `summary|description|abstract|content|text|body`
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Python deps: `python-dotenv`, `httpx`, `gremlinpython`, `pyspark`
|
||||||
|
- Spark/Iceberg/Nessie jars (default package coordinates are baked into script)
|
||||||
|
- Network access to:
|
||||||
|
- Nessie API (example: `http://lakehouse-core:19120/api/v2`)
|
||||||
|
- MinIO S3 endpoint (example: `http://lakehouse-core:9000`)
|
||||||
|
- JanusGraph Gremlin endpoint
|
||||||
|
- Elasticsearch endpoint
|
||||||
|
|
||||||
|
## Recommended isolated env
|
||||||
|
|
||||||
|
Do not install projector dependencies into system Python.
|
||||||
|
|
||||||
|
## Preferred: existing spark container on lakehouse-core
|
||||||
|
|
||||||
|
This reuses your existing `spark` container and Spark properties file.
|
||||||
|
|
||||||
|
Standard command (frozen):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./run-projector-standard.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Run by release name (no manifest path):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
|
||||||
|
```
|
||||||
|
|
||||||
|
Standard dry-run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./run-projector-standard.sh --dry-run
|
||||||
|
```
|
||||||
|
|
||||||
|
Copy files to host:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rsync -av --delete /home/niklas/projects/jecio/ lakehouse-core.rakeroots.lan:/tmp/jecio/
|
||||||
|
```
|
||||||
|
|
||||||
|
Run dry-run projection inside `spark` container:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-via-spark-container.sh ./manifests/rel_2026-02-14_docs-v1.json lake.db1.docs --dry-run es'
|
||||||
|
```
|
||||||
|
|
||||||
|
Run publish projection (writes Janus/ES):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-standard.sh'
|
||||||
|
```
|
||||||
|
|
||||||
|
`run-projector-via-spark-container.sh` uses:
|
||||||
|
- container: `spark` (override with `SPARK_CONTAINER_NAME`)
|
||||||
|
- properties file: `/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf` (override with `SPARK_PROPS`)
|
||||||
|
- Spark packages: Iceberg + Nessie extensions (override with `SPARK_PACKAGES`)
|
||||||
|
- arg4 `targets`: `es|gremlin|both` (default `both`)
|
||||||
|
- arg5 `release_name`: optional; if set, loads manifest from `releases_v2`
|
||||||
|
|
||||||
|
Direct projector usage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets es --dry-run
|
||||||
|
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets both
|
||||||
|
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets es --dry-run
|
||||||
|
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets both
|
||||||
|
```
|
||||||
|
|
||||||
|
Local setup (fallback):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./setup_local_env.sh .venv-projector
|
||||||
|
source .venv-projector/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
Remote setup (fallback, venv on `lakehouse-core`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp release_projector.py requirements-projector.txt manifests/rel_2026-02-14_docs-v1.json lakehouse-core.rakeroots.lan:/tmp/
|
||||||
|
ssh lakehouse-core.rakeroots.lan 'python3 -m venv /tmp/jecio-projector-venv && /tmp/jecio-projector-venv/bin/pip install --upgrade pip && /tmp/jecio-projector-venv/bin/pip install -r /tmp/requirements-projector.txt'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Required env vars (example)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export NESSIE_URI=http://lakehouse-core:19120/api/v2
|
||||||
|
export NESSIE_WAREHOUSE=s3a://lakehouse/warehouse
|
||||||
|
export S3_ENDPOINT=http://lakehouse-core:9000
|
||||||
|
export AWS_ACCESS_KEY_ID=minioadmin
|
||||||
|
export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||||
|
|
||||||
|
export GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin
|
||||||
|
export ES_URL=http://janus.rakeroots.lan:9200
|
||||||
|
export ES_INDEX=concepts
|
||||||
|
```
|
||||||
|
|
||||||
|
## Run
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/tmp/jecio-projector-venv/bin/python /tmp/release_projector.py \
|
||||||
|
--manifest-file /tmp/rel_2026-02-14_docs-v1.json \
|
||||||
|
--concept-table lake.db1.docs \
|
||||||
|
--dry-run
|
||||||
|
```
|
||||||
|
|
||||||
|
Or local:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 release_projector.py \
|
||||||
|
--manifest-file /path/to/release.json \
|
||||||
|
--concept-table lake.db1.concepts
|
||||||
|
```
|
||||||
|
|
||||||
|
If the manifest has a Nessie tag in fields like `nessie.tag`, you can omit `--nessie-ref`.
|
||||||
|
|
||||||
|
Dry run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 release_projector.py \
|
||||||
|
--manifest-file /path/to/release.json \
|
||||||
|
--concept-table lake.db1.concepts \
|
||||||
|
--dry-run
|
||||||
|
```
|
||||||
126
connectivity_check.py
Normal file
126
connectivity_check.py
Normal file
|
|
@ -0,0 +1,126 @@
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
import requests
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
# Optional: only needed for Gremlin websocket test
|
||||||
|
try:
|
||||||
|
import websocket
|
||||||
|
HAS_WEBSOCKET = True
|
||||||
|
except ImportError:
|
||||||
|
HAS_WEBSOCKET = False
|
||||||
|
|
||||||
|
|
||||||
|
def ok(msg):
|
||||||
|
print(f"[ OK ] {msg}")
|
||||||
|
|
||||||
|
|
||||||
|
def fail(msg):
|
||||||
|
print(f"[FAIL] {msg}")
|
||||||
|
|
||||||
|
|
||||||
|
def load_env():
|
||||||
|
load_dotenv()
|
||||||
|
ok("Loaded .env file")
|
||||||
|
|
||||||
|
|
||||||
|
def test_http(name, url, path="", method="GET", json_body=None):
|
||||||
|
full_url = url.rstrip("/") + path
|
||||||
|
try:
|
||||||
|
resp = requests.request(
|
||||||
|
method,
|
||||||
|
full_url,
|
||||||
|
json=json_body,
|
||||||
|
timeout=5,
|
||||||
|
)
|
||||||
|
if resp.status_code < 400:
|
||||||
|
ok(f"{name} reachable ({resp.status_code}) → {full_url}")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
fail(f"{name} error ({resp.status_code}) → {full_url}")
|
||||||
|
except Exception as e:
|
||||||
|
fail(f"{name} unreachable → {full_url} ({e})")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_gremlin_ws(url):
|
||||||
|
if not HAS_WEBSOCKET:
|
||||||
|
fail("Gremlin test skipped (websocket-client not installed)")
|
||||||
|
return False
|
||||||
|
|
||||||
|
try:
|
||||||
|
ws = websocket.create_connection(url, timeout=5)
|
||||||
|
ws.close()
|
||||||
|
ok(f"Gremlin websocket reachable → {url}")
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
fail(f"Gremlin websocket unreachable → {url} ({e})")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
load_env()
|
||||||
|
|
||||||
|
GREMLIN_URL = os.getenv("GREMLIN_URL", "ws://localhost:8182/gremlin")
|
||||||
|
ES_URL = os.getenv("ES_URL", "http://localhost:9200")
|
||||||
|
ES_INDEX = os.getenv("ES_INDEX", "concepts")
|
||||||
|
IPFS_API = os.getenv("IPFS_API", "http://localhost:5001")
|
||||||
|
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434")
|
||||||
|
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3.1:8b")
|
||||||
|
OLLAMA_EMBED_MODEL = os.getenv("OLLAMA_EMBED_MODEL", "nomic-embed-text")
|
||||||
|
|
||||||
|
print("\n=== Connectivity checks ===\n")
|
||||||
|
|
||||||
|
# Gremlin
|
||||||
|
test_gremlin_ws(GREMLIN_URL)
|
||||||
|
|
||||||
|
# Elasticsearch root
|
||||||
|
test_http("Elasticsearch", ES_URL)
|
||||||
|
|
||||||
|
# Elasticsearch index existence
|
||||||
|
test_http(
|
||||||
|
"Elasticsearch index",
|
||||||
|
ES_URL,
|
||||||
|
path=f"/{ES_INDEX}",
|
||||||
|
method="HEAD",
|
||||||
|
)
|
||||||
|
|
||||||
|
# IPFS (Kubo)
|
||||||
|
test_http(
|
||||||
|
"IPFS API",
|
||||||
|
IPFS_API,
|
||||||
|
path="/api/v0/version",
|
||||||
|
method="POST",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Ollama base
|
||||||
|
test_http(
|
||||||
|
"Ollama",
|
||||||
|
OLLAMA_URL,
|
||||||
|
path="/api/tags",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Ollama model availability (best-effort)
|
||||||
|
try:
|
||||||
|
resp = requests.get(f"{OLLAMA_URL}/api/tags", timeout=5)
|
||||||
|
models = [m["name"] for m in resp.json().get("models", [])]
|
||||||
|
|
||||||
|
if OLLAMA_MODEL in models:
|
||||||
|
ok(f"Ollama model available → {OLLAMA_MODEL}")
|
||||||
|
else:
|
||||||
|
fail(f"Ollama model NOT found → {OLLAMA_MODEL}")
|
||||||
|
|
||||||
|
if OLLAMA_EMBED_MODEL in models:
|
||||||
|
ok(f"Ollama embed model available → {OLLAMA_EMBED_MODEL}")
|
||||||
|
else:
|
||||||
|
fail(f"Ollama embed model NOT found → {OLLAMA_EMBED_MODEL}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
fail(f"Ollama model check failed ({e})")
|
||||||
|
|
||||||
|
print("\n=== Done ===\n")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
47
create-messages-release-via-spark-container.sh
Executable file
47
create-messages-release-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,47 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
RELEASE_NAME="${1:-rel_$(date -u +%Y-%m-%d)_messages-v1}"
|
||||||
|
TABLE="${MESSAGES_TABLE:-lake.db1.messages}"
|
||||||
|
MANIFEST_LOCAL="${2:-./manifests/${RELEASE_NAME}.json}"
|
||||||
|
DESCRIPTION="${RELEASE_DESCRIPTION:-Messages release for ${TABLE}}"
|
||||||
|
CREATED_BY="${RELEASE_CREATED_BY:-${USER:-unknown}}"
|
||||||
|
NESSIE_URI="${NESSIE_URI:-http://nessie:19120/api/v2}"
|
||||||
|
RELEASES_TABLE="${RELEASES_TABLE:-lake.db1.releases_v2}"
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./create_release_manifest.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/create_release_manifest.py"
|
||||||
|
MANIFEST_REMOTE="/tmp/${RELEASE_NAME}.json"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "create_release_manifest.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
mkdir -p "$(dirname "$MANIFEST_LOCAL")"
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--release-name "$RELEASE_NAME" \
|
||||||
|
--table "$TABLE" \
|
||||||
|
--nessie-uri "$NESSIE_URI" \
|
||||||
|
--manifest-out "$MANIFEST_REMOTE" \
|
||||||
|
--description "$DESCRIPTION" \
|
||||||
|
--created-by "$CREATED_BY" \
|
||||||
|
--releases-table "$RELEASES_TABLE"
|
||||||
|
|
||||||
|
docker cp "$CONTAINER_NAME":"$MANIFEST_REMOTE" "$MANIFEST_LOCAL"
|
||||||
|
|
||||||
|
echo "[DONE] Saved manifest: $MANIFEST_LOCAL"
|
||||||
35
create-messages-table-via-spark-container.sh
Executable file
35
create-messages-table-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,35 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Creates Iceberg table for assistant message ingest.
|
||||||
|
# Default table: lake.db1.messages
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
MESSAGES_TABLE="${MESSAGES_TABLE:-lake.db1.messages}"
|
||||||
|
|
||||||
|
SQL="
|
||||||
|
CREATE NAMESPACE IF NOT EXISTS lake.db1;
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS ${MESSAGES_TABLE} (
|
||||||
|
thread_id STRING,
|
||||||
|
message_id STRING,
|
||||||
|
sender STRING,
|
||||||
|
channel STRING,
|
||||||
|
sent_at TIMESTAMP,
|
||||||
|
body STRING,
|
||||||
|
metadata_json STRING
|
||||||
|
)
|
||||||
|
USING iceberg
|
||||||
|
PARTITIONED BY (days(sent_at));
|
||||||
|
"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-sql \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
-e "$SQL"
|
||||||
279
create_release_manifest.py
Normal file
279
create_release_manifest.py
Normal file
|
|
@ -0,0 +1,279 @@
|
||||||
|
import argparse
|
||||||
|
import hashlib
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import urllib.error
|
||||||
|
import urllib.parse
|
||||||
|
import urllib.request
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
from pyspark.sql import types as T
|
||||||
|
|
||||||
|
|
||||||
|
def now_iso() -> str:
|
||||||
|
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace('+00:00', 'Z')
|
||||||
|
|
||||||
|
|
||||||
|
def http_json(method: str, url: str, payload: dict | None = None) -> dict:
|
||||||
|
data = json.dumps(payload).encode("utf-8") if payload is not None else None
|
||||||
|
req = urllib.request.Request(url, data=data, method=method)
|
||||||
|
req.add_header("Content-Type", "application/json")
|
||||||
|
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||||
|
body = resp.read().decode("utf-8")
|
||||||
|
return json.loads(body) if body else {}
|
||||||
|
|
||||||
|
|
||||||
|
def get_ref(nessie_uri: str, ref_name: str) -> dict | None:
|
||||||
|
try:
|
||||||
|
return http_json("GET", f"{nessie_uri.rstrip('/')}/trees/{urllib.parse.quote(ref_name, safe='')}")
|
||||||
|
except urllib.error.HTTPError as e:
|
||||||
|
if e.code == 404:
|
||||||
|
return None
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
def extract_ref_hash(ref_obj: dict) -> str:
|
||||||
|
# Nessie responses can vary by endpoint/version:
|
||||||
|
# - {"type":"BRANCH","name":"main","hash":"..."}
|
||||||
|
# - {"reference":{"type":"BRANCH","name":"main","hash":"..."}}
|
||||||
|
if isinstance(ref_obj.get("hash"), str) and ref_obj["hash"]:
|
||||||
|
return ref_obj["hash"]
|
||||||
|
reference = ref_obj.get("reference")
|
||||||
|
if isinstance(reference, dict) and isinstance(reference.get("hash"), str) and reference["hash"]:
|
||||||
|
return reference["hash"]
|
||||||
|
raise KeyError("hash")
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_tag(nessie_uri: str, tag_name: str) -> dict:
|
||||||
|
existing = get_ref(nessie_uri, tag_name)
|
||||||
|
if existing is not None:
|
||||||
|
return existing
|
||||||
|
|
||||||
|
main_ref = http_json("GET", f"{nessie_uri.rstrip('/')}/trees/main")
|
||||||
|
payload = {
|
||||||
|
"type": "BRANCH",
|
||||||
|
"name": "main",
|
||||||
|
"hash": extract_ref_hash(main_ref),
|
||||||
|
}
|
||||||
|
query = urllib.parse.urlencode({"name": tag_name, "type": "TAG"})
|
||||||
|
http_json("POST", f"{nessie_uri.rstrip('/')}/trees?{query}", payload)
|
||||||
|
created = get_ref(nessie_uri, tag_name)
|
||||||
|
if created is None:
|
||||||
|
raise RuntimeError(f"Tag creation appeared to succeed but tag '{tag_name}' is not retrievable")
|
||||||
|
return created
|
||||||
|
|
||||||
|
|
||||||
|
def create_registry_table_if_missing(spark: SparkSession, releases_table: str) -> None:
|
||||||
|
spark.sql(
|
||||||
|
f"""
|
||||||
|
CREATE TABLE IF NOT EXISTS {releases_table} (
|
||||||
|
release_name STRING,
|
||||||
|
ref_type STRING,
|
||||||
|
ref_name STRING,
|
||||||
|
ref_hash STRING,
|
||||||
|
created_at_utc STRING,
|
||||||
|
ingested_at_utc STRING,
|
||||||
|
table_identifier STRING,
|
||||||
|
snapshot_id BIGINT,
|
||||||
|
metadata_location STRING,
|
||||||
|
manifest_sha256 STRING,
|
||||||
|
manifest_json STRING
|
||||||
|
) USING iceberg
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _to_utc_datetime(value: str):
|
||||||
|
# Accept ISO strings with 'Z' suffix.
|
||||||
|
return datetime.fromisoformat(value.replace("Z", "+00:00")).astimezone(timezone.utc)
|
||||||
|
|
||||||
|
|
||||||
|
def _convert_value_for_type(field: T.StructField, value):
|
||||||
|
if value is None:
|
||||||
|
return None
|
||||||
|
dt = field.dataType
|
||||||
|
if isinstance(dt, T.StringType):
|
||||||
|
return str(value)
|
||||||
|
if isinstance(dt, T.LongType):
|
||||||
|
return int(value)
|
||||||
|
if isinstance(dt, T.IntegerType):
|
||||||
|
return int(value)
|
||||||
|
if isinstance(dt, T.ShortType):
|
||||||
|
return int(value)
|
||||||
|
if isinstance(dt, T.ByteType):
|
||||||
|
return int(value)
|
||||||
|
if isinstance(dt, T.BooleanType):
|
||||||
|
return bool(value)
|
||||||
|
if isinstance(dt, T.FloatType):
|
||||||
|
return float(value)
|
||||||
|
if isinstance(dt, T.DoubleType):
|
||||||
|
return float(value)
|
||||||
|
if isinstance(dt, T.TimestampType):
|
||||||
|
if isinstance(value, datetime):
|
||||||
|
return value
|
||||||
|
return _to_utc_datetime(str(value))
|
||||||
|
if isinstance(dt, T.DateType):
|
||||||
|
if isinstance(value, datetime):
|
||||||
|
return value.date()
|
||||||
|
return _to_utc_datetime(str(value)).date()
|
||||||
|
# Leave unsupported/complex types as-is; Spark can still validate and fail clearly.
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def append_registry_row(
|
||||||
|
spark: SparkSession,
|
||||||
|
releases_table: str,
|
||||||
|
release_name: str,
|
||||||
|
ref_type: str,
|
||||||
|
ref_name: str,
|
||||||
|
ref_hash: str,
|
||||||
|
created_at_utc: str,
|
||||||
|
ingested_at_utc: str,
|
||||||
|
table_identifier: str,
|
||||||
|
snapshot_id: int,
|
||||||
|
metadata_location: str,
|
||||||
|
manifest_sha256: str,
|
||||||
|
manifest_json: str,
|
||||||
|
created_by: str,
|
||||||
|
description: str,
|
||||||
|
) -> None:
|
||||||
|
target_schema = spark.table(releases_table).schema
|
||||||
|
base_values = {
|
||||||
|
"release_name": release_name,
|
||||||
|
"ref_type": ref_type,
|
||||||
|
"ref_name": ref_name,
|
||||||
|
"ref_hash": ref_hash,
|
||||||
|
"created_at_utc": created_at_utc,
|
||||||
|
"ingested_at_utc": ingested_at_utc,
|
||||||
|
"table_identifier": table_identifier,
|
||||||
|
"snapshot_id": int(snapshot_id),
|
||||||
|
"metadata_location": metadata_location,
|
||||||
|
"manifest_sha256": manifest_sha256,
|
||||||
|
"manifest_json": manifest_json,
|
||||||
|
"created_by": created_by,
|
||||||
|
"description": description,
|
||||||
|
"release_description": description,
|
||||||
|
}
|
||||||
|
|
||||||
|
row_values = []
|
||||||
|
missing_required = []
|
||||||
|
for field in target_schema.fields:
|
||||||
|
name = field.name
|
||||||
|
if name in base_values:
|
||||||
|
value = _convert_value_for_type(field, base_values[name])
|
||||||
|
row_values.append(value)
|
||||||
|
continue
|
||||||
|
if field.nullable:
|
||||||
|
row_values.append(None)
|
||||||
|
continue
|
||||||
|
missing_required.append(name)
|
||||||
|
|
||||||
|
if missing_required:
|
||||||
|
raise RuntimeError(
|
||||||
|
"Cannot append to registry table "
|
||||||
|
f"{releases_table}. Missing required columns with no known mapping: {', '.join(missing_required)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
df = spark.createDataFrame([tuple(row_values)], schema=target_schema)
|
||||||
|
df.writeTo(releases_table).append()
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Create a release tag + manifest + registry row for a table.")
|
||||||
|
p.add_argument("--release-name", required=True)
|
||||||
|
p.add_argument("--table", default="lake.db1.messages")
|
||||||
|
p.add_argument("--nessie-uri", default=os.getenv("NESSIE_URI", "http://nessie:19120/api/v2"))
|
||||||
|
p.add_argument("--manifest-out", required=True)
|
||||||
|
p.add_argument("--description", default="Messages release")
|
||||||
|
p.add_argument("--created-by", default=os.getenv("USER", "unknown"))
|
||||||
|
p.add_argument("--releases-table", default=os.getenv("RELEASES_TABLE", "lake.db1.releases_v2"))
|
||||||
|
p.add_argument("--skip-registry", action="store_true")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
created_at = now_iso()
|
||||||
|
tag_ref = ensure_tag(args.nessie_uri, args.release_name)
|
||||||
|
ref_hash = extract_ref_hash(tag_ref)
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("create-release-manifest").getOrCreate()
|
||||||
|
|
||||||
|
snap_row = spark.sql(
|
||||||
|
f"SELECT snapshot_id FROM {args.table}.snapshots ORDER BY committed_at DESC LIMIT 1"
|
||||||
|
).collect()
|
||||||
|
if not snap_row:
|
||||||
|
raise RuntimeError(f"No snapshots found for table {args.table}")
|
||||||
|
snapshot_id = int(snap_row[0]["snapshot_id"])
|
||||||
|
|
||||||
|
meta_row = spark.sql(
|
||||||
|
f"SELECT file AS metadata_location FROM {args.table}.metadata_log_entries ORDER BY timestamp DESC LIMIT 1"
|
||||||
|
).collect()
|
||||||
|
if not meta_row:
|
||||||
|
raise RuntimeError(f"No metadata log entries found for table {args.table}")
|
||||||
|
metadata_location = str(meta_row[0]["metadata_location"])
|
||||||
|
|
||||||
|
manifest = {
|
||||||
|
"schema_version": "lakehouse-release-manifest/v1",
|
||||||
|
"release": {
|
||||||
|
"name": args.release_name,
|
||||||
|
"created_at_utc": created_at,
|
||||||
|
"created_by": args.created_by,
|
||||||
|
"description": args.description,
|
||||||
|
},
|
||||||
|
"nessie": {
|
||||||
|
"uri": args.nessie_uri,
|
||||||
|
"ref": {
|
||||||
|
"type": "tag",
|
||||||
|
"name": args.release_name,
|
||||||
|
"hash": ref_hash,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"tables": [
|
||||||
|
{
|
||||||
|
"identifier": args.table,
|
||||||
|
"format": "iceberg",
|
||||||
|
"current_snapshot_id": snapshot_id,
|
||||||
|
"metadata_location": metadata_location,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
manifest_json = json.dumps(manifest, ensure_ascii=False, indent=2)
|
||||||
|
manifest_sha256 = hashlib.sha256(manifest_json.encode("utf-8")).hexdigest()
|
||||||
|
|
||||||
|
os.makedirs(os.path.dirname(args.manifest_out) or ".", exist_ok=True)
|
||||||
|
with open(args.manifest_out, "w", encoding="utf-8") as f:
|
||||||
|
f.write(manifest_json)
|
||||||
|
|
||||||
|
if not args.skip_registry:
|
||||||
|
create_registry_table_if_missing(spark, args.releases_table)
|
||||||
|
append_registry_row(
|
||||||
|
spark=spark,
|
||||||
|
releases_table=args.releases_table,
|
||||||
|
release_name=args.release_name,
|
||||||
|
ref_type="tag",
|
||||||
|
ref_name=args.release_name,
|
||||||
|
ref_hash=ref_hash,
|
||||||
|
created_at_utc=created_at,
|
||||||
|
ingested_at_utc=now_iso(),
|
||||||
|
table_identifier=args.table,
|
||||||
|
snapshot_id=snapshot_id,
|
||||||
|
metadata_location=metadata_location,
|
||||||
|
manifest_sha256=manifest_sha256,
|
||||||
|
manifest_json=manifest_json,
|
||||||
|
created_by=args.created_by,
|
||||||
|
description=args.description,
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"[INFO] release_name={args.release_name}")
|
||||||
|
print(f"[INFO] table={args.table}")
|
||||||
|
print(f"[INFO] ref_hash={ref_hash}")
|
||||||
|
print(f"[INFO] snapshot_id={snapshot_id}")
|
||||||
|
print(f"[INFO] manifest_out={args.manifest_out}")
|
||||||
|
if args.skip_registry:
|
||||||
|
print("[INFO] registry=skipped")
|
||||||
|
else:
|
||||||
|
print(f"[INFO] registry_table={args.releases_table}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
21
docker/projector/Dockerfile
Normal file
21
docker/projector/Dockerfile
Normal file
|
|
@ -0,0 +1,21 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
ENV DEBIAN_FRONTEND=noninteractive \
|
||||||
|
PYTHONDONTWRITEBYTECODE=1 \
|
||||||
|
PYTHONUNBUFFERED=1 \
|
||||||
|
PIP_NO_CACHE_DIR=1 \
|
||||||
|
SPARK_LOCAL_HOSTNAME=localhost \
|
||||||
|
SPARK_LOCAL_IP=127.0.0.1
|
||||||
|
|
||||||
|
RUN apt-get update \
|
||||||
|
&& apt-get install -y --no-install-recommends default-jre-headless ca-certificates \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
COPY requirements-projector.txt /app/requirements-projector.txt
|
||||||
|
RUN pip install --upgrade pip && pip install -r /app/requirements-projector.txt
|
||||||
|
|
||||||
|
COPY release_projector.py /app/release_projector.py
|
||||||
|
|
||||||
|
ENTRYPOINT ["python", "/app/release_projector.py"]
|
||||||
41
docker/projector/README.md
Normal file
41
docker/projector/README.md
Normal file
|
|
@ -0,0 +1,41 @@
|
||||||
|
# Projector Container
|
||||||
|
|
||||||
|
Build on `lakehouse-core`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker build -t jecio/release-projector:0.1 -f docker/projector/Dockerfile /tmp/jecio
|
||||||
|
```
|
||||||
|
|
||||||
|
Dry-run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm --network host \
|
||||||
|
-e NESSIE_URI=http://lakehouse-core:19120/api/v2 \
|
||||||
|
-e NESSIE_WAREHOUSE=s3a://lakehouse/warehouse \
|
||||||
|
-e S3_ENDPOINT=http://lakehouse-core:9000 \
|
||||||
|
-e AWS_ACCESS_KEY_ID=minioadmin \
|
||||||
|
-e AWS_SECRET_ACCESS_KEY=minioadmin \
|
||||||
|
-v /tmp:/work \
|
||||||
|
jecio/release-projector:0.1 \
|
||||||
|
--manifest-file /work/rel_2026-02-14_docs-v1.json \
|
||||||
|
--concept-table lake.db1.docs \
|
||||||
|
--dry-run
|
||||||
|
```
|
||||||
|
|
||||||
|
Publish projection:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm --network host \
|
||||||
|
-e NESSIE_URI=http://lakehouse-core:19120/api/v2 \
|
||||||
|
-e NESSIE_WAREHOUSE=s3a://lakehouse/warehouse \
|
||||||
|
-e S3_ENDPOINT=http://lakehouse-core:9000 \
|
||||||
|
-e AWS_ACCESS_KEY_ID=minioadmin \
|
||||||
|
-e AWS_SECRET_ACCESS_KEY=minioadmin \
|
||||||
|
-e GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin \
|
||||||
|
-e ES_URL=http://janus.rakeroots.lan:9200 \
|
||||||
|
-e ES_INDEX=concepts \
|
||||||
|
-v /tmp:/work \
|
||||||
|
jecio/release-projector:0.1 \
|
||||||
|
--manifest-file /work/rel_2026-02-14_docs-v1.json \
|
||||||
|
--concept-table lake.db1.docs
|
||||||
|
```
|
||||||
56
ingest-message-via-spark-container.sh
Executable file
56
ingest-message-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,56 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
TABLE="${1:-lake.db1.messages}"
|
||||||
|
THREAD_ID="${2:-}"
|
||||||
|
MESSAGE_ID="${3:-}"
|
||||||
|
SENDER="${4:-}"
|
||||||
|
CHANNEL="${5:-}"
|
||||||
|
SENT_AT="${6:-}"
|
||||||
|
BODY_B64="${7:-}"
|
||||||
|
METADATA_B64="${8:-}"
|
||||||
|
|
||||||
|
if [[ -z "$THREAD_ID" || -z "$MESSAGE_ID" || -z "$SENDER" || -z "$CHANNEL" || -z "$BODY_B64" ]]; then
|
||||||
|
echo "Usage: $0 <table> <thread_id> <message_id> <sender> <channel> <sent_at_or_empty> <body_b64> <metadata_json_b64>" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
|
||||||
|
BODY="$(printf '%s' "$BODY_B64" | base64 -d)"
|
||||||
|
METADATA_JSON="{}"
|
||||||
|
if [[ -n "$METADATA_B64" ]]; then
|
||||||
|
METADATA_JSON="$(printf '%s' "$METADATA_B64" | base64 -d)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
sql_escape() {
|
||||||
|
printf "%s" "$1" | sed "s/'/''/g"
|
||||||
|
}
|
||||||
|
|
||||||
|
THREAD_ID_ESC="$(sql_escape "$THREAD_ID")"
|
||||||
|
MESSAGE_ID_ESC="$(sql_escape "$MESSAGE_ID")"
|
||||||
|
SENDER_ESC="$(sql_escape "$SENDER")"
|
||||||
|
CHANNEL_ESC="$(sql_escape "$CHANNEL")"
|
||||||
|
BODY_ESC="$(sql_escape "$BODY")"
|
||||||
|
METADATA_ESC="$(sql_escape "$METADATA_JSON")"
|
||||||
|
|
||||||
|
if [[ -n "$SENT_AT" ]]; then
|
||||||
|
SENT_AT_EXPR="TIMESTAMP '$(sql_escape "$SENT_AT")'"
|
||||||
|
else
|
||||||
|
SENT_AT_EXPR="current_timestamp()"
|
||||||
|
fi
|
||||||
|
|
||||||
|
SQL="INSERT INTO ${TABLE} (thread_id, message_id, sender, channel, sent_at, body, metadata_json) VALUES ('${THREAD_ID_ESC}', '${MESSAGE_ID_ESC}', '${SENDER_ESC}', '${CHANNEL_ESC}', ${SENT_AT_EXPR}, '${BODY_ESC}', '${METADATA_ESC}')"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-sql \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
-e "$SQL"
|
||||||
|
|
||||||
|
echo "[DONE] Inserted message_id=${MESSAGE_ID} thread_id=${THREAD_ID} into ${TABLE}"
|
||||||
60
ingest-messages-batch-via-spark-container.sh
Executable file
60
ingest-messages-batch-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,60 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
TABLE="${1:-lake.db1.messages}"
|
||||||
|
DEDUPE_MODE="${2:-none}"
|
||||||
|
PAYLOAD_B64="${3:-}"
|
||||||
|
|
||||||
|
if [[ -z "$PAYLOAD_B64" ]]; then
|
||||||
|
echo "Usage: $0 <table> <dedupe_mode:none|message_id|thread_message> <payload_b64_json_array|@/path/to/payload.json>" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$DEDUPE_MODE" != "none" && "$DEDUPE_MODE" != "message_id" && "$DEDUPE_MODE" != "thread_message" ]]; then
|
||||||
|
echo "Invalid dedupe_mode: $DEDUPE_MODE (expected none|message_id|thread_message)" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./ingest_messages_batch.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/ingest_messages_batch.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "ingest_messages_batch.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
SPARK_ARGS=(
|
||||||
|
--table "$TABLE"
|
||||||
|
--dedupe-mode "$DEDUPE_MODE"
|
||||||
|
)
|
||||||
|
|
||||||
|
if [[ "${PAYLOAD_B64:0:1}" == "@" ]]; then
|
||||||
|
PAYLOAD_FILE_HOST="${PAYLOAD_B64:1}"
|
||||||
|
if [[ ! -f "$PAYLOAD_FILE_HOST" ]]; then
|
||||||
|
echo "Payload file not found: $PAYLOAD_FILE_HOST" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
PAYLOAD_FILE_REMOTE="/opt/spark/work-dir/ingest_messages_payload.json"
|
||||||
|
docker cp "$PAYLOAD_FILE_HOST" "$CONTAINER_NAME":"$PAYLOAD_FILE_REMOTE"
|
||||||
|
# Ensure spark user can read the file regardless of ownership from docker cp.
|
||||||
|
docker exec -u 0 "$CONTAINER_NAME" /bin/sh -lc "chmod 644 '$PAYLOAD_FILE_REMOTE' || true"
|
||||||
|
SPARK_ARGS+=(--payload-file "$PAYLOAD_FILE_REMOTE")
|
||||||
|
else
|
||||||
|
SPARK_ARGS+=(--payload-b64 "$PAYLOAD_B64")
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
"${SPARK_ARGS[@]}"
|
||||||
139
ingest_messages_batch.py
Normal file
139
ingest_messages_batch.py
Normal file
|
|
@ -0,0 +1,139 @@
|
||||||
|
import argparse
|
||||||
|
import base64
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Any, Dict, List
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession, types as T
|
||||||
|
|
||||||
|
|
||||||
|
def now_iso() -> str:
|
||||||
|
return datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
|
||||||
|
def decode_payload(payload_b64: str) -> List[Dict[str, Any]]:
|
||||||
|
raw = base64.b64decode(payload_b64.encode("ascii")).decode("utf-8")
|
||||||
|
data = json.loads(raw)
|
||||||
|
if not isinstance(data, list):
|
||||||
|
raise ValueError("Payload must decode to a JSON array")
|
||||||
|
out: List[Dict[str, Any]] = []
|
||||||
|
for i, row in enumerate(data):
|
||||||
|
if not isinstance(row, dict):
|
||||||
|
raise ValueError(f"Row {i} must be a JSON object")
|
||||||
|
out.append(row)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_rows(rows: List[Dict[str, Any]]) -> List[tuple]:
|
||||||
|
norm: List[tuple] = []
|
||||||
|
for i, r in enumerate(rows):
|
||||||
|
thread_id = str(r.get("thread_id") or "").strip()
|
||||||
|
message_id = str(r.get("message_id") or "").strip()
|
||||||
|
sender = str(r.get("sender") or "").strip()
|
||||||
|
channel = str(r.get("channel") or "").strip()
|
||||||
|
body = str(r.get("body") or "").strip()
|
||||||
|
if not thread_id or not message_id or not sender or not channel or not body:
|
||||||
|
raise ValueError(
|
||||||
|
f"Row {i} missing required fields. "
|
||||||
|
"Required: thread_id, message_id, sender, channel, body"
|
||||||
|
)
|
||||||
|
|
||||||
|
sent_at_raw = r.get("sent_at")
|
||||||
|
sent_at = str(sent_at_raw).strip() if sent_at_raw is not None else ""
|
||||||
|
metadata = r.get("metadata", {})
|
||||||
|
if not isinstance(metadata, dict):
|
||||||
|
metadata = {}
|
||||||
|
metadata_json = json.dumps(metadata, ensure_ascii=False, sort_keys=True)
|
||||||
|
norm.append((thread_id, message_id, sender, channel, sent_at, body, metadata_json))
|
||||||
|
return norm
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Batch ingest messages into Iceberg table")
|
||||||
|
p.add_argument("--table", required=True)
|
||||||
|
p.add_argument(
|
||||||
|
"--dedupe-mode",
|
||||||
|
choices=["none", "message_id", "thread_message"],
|
||||||
|
default="none",
|
||||||
|
help="Optional dedupe strategy against existing target rows",
|
||||||
|
)
|
||||||
|
p.add_argument("--payload-b64")
|
||||||
|
p.add_argument("--payload-file")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
if not args.payload_b64 and not args.payload_file:
|
||||||
|
raise ValueError("Provide either --payload-b64 or --payload-file")
|
||||||
|
if args.payload_b64 and args.payload_file:
|
||||||
|
raise ValueError("Provide only one of --payload-b64 or --payload-file")
|
||||||
|
|
||||||
|
if args.payload_file:
|
||||||
|
with open(args.payload_file, "r", encoding="utf-8") as f:
|
||||||
|
file_data = json.load(f)
|
||||||
|
if not isinstance(file_data, list):
|
||||||
|
raise ValueError("--payload-file must contain a JSON array")
|
||||||
|
rows = normalize_rows(file_data)
|
||||||
|
else:
|
||||||
|
rows = normalize_rows(decode_payload(args.payload_b64 or ""))
|
||||||
|
if not rows:
|
||||||
|
print("[INFO] No rows supplied; nothing to ingest.")
|
||||||
|
return
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("ingest-messages-batch").getOrCreate()
|
||||||
|
|
||||||
|
schema = T.StructType(
|
||||||
|
[
|
||||||
|
T.StructField("thread_id", T.StringType(), False),
|
||||||
|
T.StructField("message_id", T.StringType(), False),
|
||||||
|
T.StructField("sender", T.StringType(), False),
|
||||||
|
T.StructField("channel", T.StringType(), False),
|
||||||
|
T.StructField("sent_at_raw", T.StringType(), True),
|
||||||
|
T.StructField("body", T.StringType(), False),
|
||||||
|
T.StructField("metadata_json", T.StringType(), False),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
df = spark.createDataFrame(rows, schema=schema)
|
||||||
|
df.createOrReplaceTempView("_batch_messages")
|
||||||
|
|
||||||
|
base_select = """
|
||||||
|
SELECT
|
||||||
|
b.thread_id,
|
||||||
|
b.message_id,
|
||||||
|
b.sender,
|
||||||
|
b.channel,
|
||||||
|
CASE
|
||||||
|
WHEN b.sent_at_raw IS NULL OR TRIM(b.sent_at_raw) = '' THEN current_timestamp()
|
||||||
|
ELSE CAST(b.sent_at_raw AS TIMESTAMP)
|
||||||
|
END AS sent_at,
|
||||||
|
b.body,
|
||||||
|
b.metadata_json
|
||||||
|
FROM _batch_messages b
|
||||||
|
"""
|
||||||
|
if args.dedupe_mode == "none":
|
||||||
|
insert_select = base_select
|
||||||
|
elif args.dedupe_mode == "message_id":
|
||||||
|
insert_select = (
|
||||||
|
base_select
|
||||||
|
+ f" LEFT ANTI JOIN {args.table} t ON b.message_id = t.message_id"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
insert_select = (
|
||||||
|
base_select
|
||||||
|
+ f" LEFT ANTI JOIN {args.table} t ON b.thread_id = t.thread_id AND b.message_id = t.message_id"
|
||||||
|
)
|
||||||
|
|
||||||
|
spark.sql(
|
||||||
|
f"""
|
||||||
|
INSERT INTO {args.table} (thread_id, message_id, sender, channel, sent_at, body, metadata_json)
|
||||||
|
{insert_select}
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"[INFO] rows_in={len(rows)}")
|
||||||
|
print(f"[INFO] dedupe_mode={args.dedupe_mode}")
|
||||||
|
print(f"[INFO] table={args.table}")
|
||||||
|
print(f"[INFO] ingested_at_utc={now_iso()}")
|
||||||
|
print(f"[DONE] Batch ingest finished for {args.table}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
42
manifests/rel_2026-02-14_docs-v1.json
Normal file
42
manifests/rel_2026-02-14_docs-v1.json
Normal file
|
|
@ -0,0 +1,42 @@
|
||||||
|
{
|
||||||
|
"schema_version": "lakehouse-release-manifest/v1",
|
||||||
|
"release": {
|
||||||
|
"name": "rel_2026-02-14_docs-v1",
|
||||||
|
"created_at_utc": "2026-02-14T09:48:38Z",
|
||||||
|
"created_by": "niklas",
|
||||||
|
"description": "First tagged release for lake.db1.docs"
|
||||||
|
},
|
||||||
|
"nessie": {
|
||||||
|
"uri": "http://lakehouse-core:19120/api/v2",
|
||||||
|
"ref": {
|
||||||
|
"type": "tag",
|
||||||
|
"name": "rel_2026-02-14_docs-v1",
|
||||||
|
"hash": "1b16b4c4f6e99d43a27a21712aab319c1840a415f36bc6bebb2c9d2a89f09ef0"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"warehouse": {
|
||||||
|
"bucket": "lakehouse",
|
||||||
|
"warehouse_path": "s3a://lakehouse/warehouse",
|
||||||
|
"s3_endpoint": "http://lakehouse-core:9000",
|
||||||
|
"region": "us-east-1"
|
||||||
|
},
|
||||||
|
"tables": [
|
||||||
|
{
|
||||||
|
"identifier": "lake.db1.docs",
|
||||||
|
"format": "iceberg",
|
||||||
|
"current_snapshot_id": 4212875880010474311,
|
||||||
|
"metadata_location": "s3a://lakehouse/warehouse/db1/docs_2693aab9-54ea-43a8-892b-a922fdfc063a/metadata/00001-64f23fb4-2cb3-45c5-9c20-e6c91c9d73ef.metadata.json"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"projection": {
|
||||||
|
"enabled": false,
|
||||||
|
"projection_id": null,
|
||||||
|
"targets": []
|
||||||
|
},
|
||||||
|
"artifacts": {
|
||||||
|
"ipfs": {
|
||||||
|
"pinned": false,
|
||||||
|
"cid": null
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
39
query-assistant-actions-via-spark-container.sh
Executable file
39
query-assistant-actions-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,39 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
STATUS="${1:-}"
|
||||||
|
TASK_TYPE="${2:-}"
|
||||||
|
RELEASE_NAME="${3:-}"
|
||||||
|
STEP_ID="${4:-}"
|
||||||
|
ACTION_TYPE="${5:-}"
|
||||||
|
LIMIT="${6:-50}"
|
||||||
|
ACTION_TABLE="${ACTION_TABLE:-lake.db1.assistant_actions}"
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_assistant_actions.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/query_assistant_actions.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "query_assistant_actions.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--table "$ACTION_TABLE" \
|
||||||
|
--status "$STATUS" \
|
||||||
|
--task-type "$TASK_TYPE" \
|
||||||
|
--release-name "$RELEASE_NAME" \
|
||||||
|
--step-id "$STEP_ID" \
|
||||||
|
--action-type "$ACTION_TYPE" \
|
||||||
|
--limit "$LIMIT"
|
||||||
35
query-assistant-feedback-via-spark-container.sh
Executable file
35
query-assistant-feedback-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,35 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
OUTCOME="${1:-}"
|
||||||
|
TASK_TYPE="${2:-}"
|
||||||
|
RELEASE_NAME="${3:-}"
|
||||||
|
LIMIT="${4:-50}"
|
||||||
|
FEEDBACK_TABLE="${FEEDBACK_TABLE:-lake.db1.assistant_feedback}"
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_assistant_feedback.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/query_assistant_feedback.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "query_assistant_feedback.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--table "$FEEDBACK_TABLE" \
|
||||||
|
--outcome "$OUTCOME" \
|
||||||
|
--task-type "$TASK_TYPE" \
|
||||||
|
--release-name "$RELEASE_NAME" \
|
||||||
|
--limit "$LIMIT"
|
||||||
42
query-assistant-metrics-via-spark-container.sh
Executable file
42
query-assistant-metrics-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,42 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
TASK_TYPE="${1:-}"
|
||||||
|
RELEASE_NAME="${2:-}"
|
||||||
|
OUTCOME="${3:-}"
|
||||||
|
GROUP_BY="${4:-both}"
|
||||||
|
LIMIT="${5:-100}"
|
||||||
|
FEEDBACK_TABLE="${FEEDBACK_TABLE:-lake.db1.assistant_feedback}"
|
||||||
|
|
||||||
|
if [[ "$GROUP_BY" != "task_type" && "$GROUP_BY" != "release_name" && "$GROUP_BY" != "both" ]]; then
|
||||||
|
echo "Invalid group_by: $GROUP_BY (expected task_type|release_name|both)" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_assistant_metrics.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/query_assistant_metrics.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "query_assistant_metrics.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--table "$FEEDBACK_TABLE" \
|
||||||
|
--task-type "$TASK_TYPE" \
|
||||||
|
--release-name "$RELEASE_NAME" \
|
||||||
|
--outcome "$OUTCOME" \
|
||||||
|
--group-by "$GROUP_BY" \
|
||||||
|
--limit "$LIMIT"
|
||||||
38
query-imap-checkpoint-via-spark-container.sh
Executable file
38
query-imap-checkpoint-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,38 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
HOST="${1:-}"
|
||||||
|
MAILBOX="${2:-}"
|
||||||
|
USERNAME="${3:-}"
|
||||||
|
TABLE="${4:-lake.db1.messages}"
|
||||||
|
|
||||||
|
if [[ -z "$HOST" || -z "$MAILBOX" || -z "$USERNAME" ]]; then
|
||||||
|
echo "Usage: $0 <host> <mailbox> <username> [table]" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_imap_checkpoint.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/query_imap_checkpoint.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "query_imap_checkpoint.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--table "$TABLE" \
|
||||||
|
--host "$HOST" \
|
||||||
|
--mailbox "$MAILBOX" \
|
||||||
|
--username "$USERNAME"
|
||||||
45
query_assistant_actions.py
Normal file
45
query_assistant_actions.py
Normal file
|
|
@ -0,0 +1,45 @@
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
from pyspark.sql import functions as F
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Query assistant actions")
|
||||||
|
p.add_argument("--table", default=os.getenv("ACTION_TABLE", "lake.db1.assistant_actions"))
|
||||||
|
p.add_argument("--status", default="")
|
||||||
|
p.add_argument("--task-type", default="")
|
||||||
|
p.add_argument("--release-name", default="")
|
||||||
|
p.add_argument("--step-id", default="")
|
||||||
|
p.add_argument("--action-type", default="")
|
||||||
|
p.add_argument("--limit", type=int, default=50)
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("query-assistant-actions").getOrCreate()
|
||||||
|
df = spark.table(args.table)
|
||||||
|
|
||||||
|
if args.status:
|
||||||
|
df = df.where(F.col("status") == args.status)
|
||||||
|
if args.task_type:
|
||||||
|
df = df.where(F.col("task_type") == args.task_type)
|
||||||
|
if args.release_name:
|
||||||
|
df = df.where(F.col("release_name") == args.release_name)
|
||||||
|
if args.step_id:
|
||||||
|
df = df.where(F.col("step_id") == args.step_id)
|
||||||
|
if args.action_type:
|
||||||
|
df = df.where(F.col("action_type") == args.action_type)
|
||||||
|
|
||||||
|
rows = (
|
||||||
|
df.orderBy(F.col("created_at_utc").desc_nulls_last())
|
||||||
|
.limit(max(1, min(args.limit, 500)))
|
||||||
|
.collect()
|
||||||
|
)
|
||||||
|
|
||||||
|
out = [r.asDict(recursive=True) for r in rows]
|
||||||
|
print(json.dumps(out, ensure_ascii=False))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
43
query_assistant_feedback.py
Normal file
43
query_assistant_feedback.py
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
from pyspark.sql import functions as F
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Query assistant feedback rows")
|
||||||
|
p.add_argument("--table", default=os.getenv("FEEDBACK_TABLE", "lake.db1.assistant_feedback"))
|
||||||
|
p.add_argument("--outcome", default="")
|
||||||
|
p.add_argument("--task-type", default="")
|
||||||
|
p.add_argument("--release-name", default="")
|
||||||
|
p.add_argument("--limit", type=int, default=50)
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("query-assistant-feedback").getOrCreate()
|
||||||
|
df = spark.table(args.table)
|
||||||
|
|
||||||
|
if args.outcome:
|
||||||
|
df = df.where(F.col("outcome") == args.outcome)
|
||||||
|
if args.task_type:
|
||||||
|
df = df.where(F.col("task_type") == args.task_type)
|
||||||
|
if args.release_name:
|
||||||
|
df = df.where(F.col("release_name") == args.release_name)
|
||||||
|
|
||||||
|
rows = (
|
||||||
|
df.orderBy(F.col("created_at_utc").desc_nulls_last())
|
||||||
|
.limit(max(1, min(args.limit, 500)))
|
||||||
|
.collect()
|
||||||
|
)
|
||||||
|
|
||||||
|
out = []
|
||||||
|
for r in rows:
|
||||||
|
item = r.asDict(recursive=True)
|
||||||
|
out.append(item)
|
||||||
|
|
||||||
|
print(json.dumps(out, ensure_ascii=False))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
57
query_assistant_metrics.py
Normal file
57
query_assistant_metrics.py
Normal file
|
|
@ -0,0 +1,57 @@
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
from pyspark.sql import functions as F
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Query assistant feedback metrics")
|
||||||
|
p.add_argument("--table", default=os.getenv("FEEDBACK_TABLE", "lake.db1.assistant_feedback"))
|
||||||
|
p.add_argument("--task-type", default="")
|
||||||
|
p.add_argument("--release-name", default="")
|
||||||
|
p.add_argument("--outcome", default="")
|
||||||
|
p.add_argument("--group-by", choices=["task_type", "release_name", "both"], default="both")
|
||||||
|
p.add_argument("--limit", type=int, default=100)
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("query-assistant-metrics").getOrCreate()
|
||||||
|
df = spark.table(args.table)
|
||||||
|
|
||||||
|
if args.task_type:
|
||||||
|
df = df.where(F.col("task_type") == args.task_type)
|
||||||
|
if args.release_name:
|
||||||
|
df = df.where(F.col("release_name") == args.release_name)
|
||||||
|
if args.outcome:
|
||||||
|
df = df.where(F.col("outcome") == args.outcome)
|
||||||
|
|
||||||
|
if args.group_by == "task_type":
|
||||||
|
group_cols = [F.col("task_type")]
|
||||||
|
elif args.group_by == "release_name":
|
||||||
|
group_cols = [F.col("release_name")]
|
||||||
|
else:
|
||||||
|
group_cols = [F.col("task_type"), F.col("release_name")]
|
||||||
|
|
||||||
|
agg = (
|
||||||
|
df.groupBy(*group_cols)
|
||||||
|
.agg(
|
||||||
|
F.count(F.lit(1)).alias("total"),
|
||||||
|
F.sum(F.when(F.col("outcome") == "accepted", F.lit(1)).otherwise(F.lit(0))).alias("accepted"),
|
||||||
|
F.sum(F.when(F.col("outcome") == "edited", F.lit(1)).otherwise(F.lit(0))).alias("edited"),
|
||||||
|
F.sum(F.when(F.col("outcome") == "rejected", F.lit(1)).otherwise(F.lit(0))).alias("rejected"),
|
||||||
|
F.avg(F.col("confidence")).alias("avg_confidence"),
|
||||||
|
)
|
||||||
|
.withColumn("accept_rate", F.when(F.col("total") > 0, F.col("accepted") / F.col("total")).otherwise(F.lit(0.0)))
|
||||||
|
.withColumn("edit_rate", F.when(F.col("total") > 0, F.col("edited") / F.col("total")).otherwise(F.lit(0.0)))
|
||||||
|
.withColumn("reject_rate", F.when(F.col("total") > 0, F.col("rejected") / F.col("total")).otherwise(F.lit(0.0)))
|
||||||
|
.orderBy(F.col("total").desc(), *[c.asc() for c in group_cols])
|
||||||
|
.limit(max(1, min(args.limit, 1000)))
|
||||||
|
)
|
||||||
|
|
||||||
|
rows = [r.asDict(recursive=True) for r in agg.collect()]
|
||||||
|
print(json.dumps(rows, ensure_ascii=False))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
43
query_imap_checkpoint.py
Normal file
43
query_imap_checkpoint.py
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
from pyspark.sql import functions as F
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Query latest IMAP UID checkpoint from messages table")
|
||||||
|
p.add_argument("--table", default=os.getenv("MESSAGES_TABLE", "lake.db1.messages"))
|
||||||
|
p.add_argument("--host", required=True)
|
||||||
|
p.add_argument("--mailbox", required=True)
|
||||||
|
p.add_argument("--username", required=True)
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("query-imap-checkpoint").getOrCreate()
|
||||||
|
df = spark.table(args.table)
|
||||||
|
|
||||||
|
md = F.col("metadata_json")
|
||||||
|
uid_col = F.get_json_object(md, "$.imap_uid")
|
||||||
|
host_col = F.get_json_object(md, "$.host")
|
||||||
|
mailbox_col = F.get_json_object(md, "$.mailbox")
|
||||||
|
username_col = F.get_json_object(md, "$.username")
|
||||||
|
|
||||||
|
filtered = (
|
||||||
|
df.where(F.col("channel") == "email-imap")
|
||||||
|
.where(host_col == args.host)
|
||||||
|
.where(mailbox_col == args.mailbox)
|
||||||
|
.where((username_col == args.username) | username_col.isNull() | (username_col == ""))
|
||||||
|
.where(uid_col.isNotNull())
|
||||||
|
)
|
||||||
|
|
||||||
|
row = filtered.select(F.max(uid_col.cast("long")).alias("max_uid")).collect()
|
||||||
|
max_uid = None
|
||||||
|
if row and row[0]["max_uid"] is not None:
|
||||||
|
max_uid = int(row[0]["max_uid"])
|
||||||
|
|
||||||
|
print(json.dumps({"max_uid": max_uid}, ensure_ascii=False))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
60
record-assistant-action-via-spark-container.sh
Executable file
60
record-assistant-action-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,60 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ACTION_TABLE="${ACTION_TABLE:-lake.db1.assistant_actions}"
|
||||||
|
ACTION_ID="${1:-}"
|
||||||
|
CREATED_AT_UTC="${2:-}"
|
||||||
|
TASK_TYPE="${3:-}"
|
||||||
|
RELEASE_NAME="${4:-}"
|
||||||
|
OBJECTIVE_B64="${5:-}"
|
||||||
|
STEP_ID="${6:-}"
|
||||||
|
STEP_TITLE_B64="${7:-}"
|
||||||
|
ACTION_TYPE="${8:-}"
|
||||||
|
REQUIRES_APPROVAL="${9:-false}"
|
||||||
|
APPROVED="${10:-false}"
|
||||||
|
STATUS="${11:-}"
|
||||||
|
OUTPUT_B64="${12:-}"
|
||||||
|
ERROR_B64="${13:-}"
|
||||||
|
|
||||||
|
if [[ -z "$ACTION_ID" || -z "$CREATED_AT_UTC" || -z "$TASK_TYPE" || -z "$STEP_ID" || -z "$ACTION_TYPE" || -z "$STATUS" ]]; then
|
||||||
|
echo "Usage: $0 <action_id> <created_at_utc> <task_type> <release_name> <objective_b64> <step_id> <step_title_b64> <action_type> <requires_approval> <approved> <status> <output_b64> <error_b64>" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./write_assistant_action.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/write_assistant_action.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "write_assistant_action.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--table "$ACTION_TABLE" \
|
||||||
|
--action-id "$ACTION_ID" \
|
||||||
|
--created-at-utc "$CREATED_AT_UTC" \
|
||||||
|
--task-type "$TASK_TYPE" \
|
||||||
|
--release-name "$RELEASE_NAME" \
|
||||||
|
--objective-b64 "$OBJECTIVE_B64" \
|
||||||
|
--step-id "$STEP_ID" \
|
||||||
|
--step-title-b64 "$STEP_TITLE_B64" \
|
||||||
|
--action-type "$ACTION_TYPE" \
|
||||||
|
--requires-approval "$REQUIRES_APPROVAL" \
|
||||||
|
--approved "$APPROVED" \
|
||||||
|
--status "$STATUS" \
|
||||||
|
--output-b64 "$OUTPUT_B64" \
|
||||||
|
--error-b64 "$ERROR_B64"
|
||||||
|
|
||||||
|
echo "[DONE] Recorded assistant action ${ACTION_ID} into ${ACTION_TABLE}"
|
||||||
58
record-assistant-feedback-via-spark-container.sh
Executable file
58
record-assistant-feedback-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,58 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
FEEDBACK_TABLE="${FEEDBACK_TABLE:-lake.db1.assistant_feedback}"
|
||||||
|
FEEDBACK_ID="${1:-}"
|
||||||
|
CREATED_AT_UTC="${2:-}"
|
||||||
|
OUTCOME="${3:-}"
|
||||||
|
TASK_TYPE="${4:-}"
|
||||||
|
RELEASE_NAME="${5:-}"
|
||||||
|
CONFIDENCE="${6:-0}"
|
||||||
|
NEEDS_REVIEW="${7:-true}"
|
||||||
|
GOAL_B64="${8:-}"
|
||||||
|
DRAFT_B64="${9:-}"
|
||||||
|
FINAL_B64="${10:-}"
|
||||||
|
SOURCES_B64="${11:-}"
|
||||||
|
NOTES_B64="${12:-}"
|
||||||
|
|
||||||
|
if [[ -z "$FEEDBACK_ID" || -z "$CREATED_AT_UTC" || -z "$OUTCOME" || -z "$TASK_TYPE" ]]; then
|
||||||
|
echo "Usage: $0 <feedback_id> <created_at_utc> <outcome> <task_type> <release_name> <confidence> <needs_review> <goal_b64> <draft_b64> <final_b64> <sources_b64> <notes_b64>" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./write_assistant_feedback.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/write_assistant_feedback.py"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "write_assistant_feedback.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"$SCRIPT_REMOTE" \
|
||||||
|
--table "$FEEDBACK_TABLE" \
|
||||||
|
--feedback-id "$FEEDBACK_ID" \
|
||||||
|
--created-at-utc "$CREATED_AT_UTC" \
|
||||||
|
--outcome "$OUTCOME" \
|
||||||
|
--task-type "$TASK_TYPE" \
|
||||||
|
--release-name "$RELEASE_NAME" \
|
||||||
|
--confidence "$CONFIDENCE" \
|
||||||
|
--needs-review "$NEEDS_REVIEW" \
|
||||||
|
--goal-b64 "$GOAL_B64" \
|
||||||
|
--draft-b64 "$DRAFT_B64" \
|
||||||
|
--final-b64 "$FINAL_B64" \
|
||||||
|
--sources-b64 "$SOURCES_B64" \
|
||||||
|
--notes-b64 "$NOTES_B64"
|
||||||
|
|
||||||
|
echo "[DONE] Recorded assistant feedback ${FEEDBACK_ID} into ${FEEDBACK_TABLE}"
|
||||||
67
record-run-event-via-spark-container.sh
Executable file
67
record-run-event-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,67 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Args:
|
||||||
|
# 1 run_id
|
||||||
|
# 2 event_type
|
||||||
|
# 3 event_at_utc
|
||||||
|
# 4 detail_json_b64
|
||||||
|
RUN_ID="${1:-}"
|
||||||
|
EVENT_TYPE="${2:-}"
|
||||||
|
EVENT_AT_UTC="${3:-}"
|
||||||
|
DETAIL_JSON_B64="${4:-}"
|
||||||
|
|
||||||
|
if [[ -z "$RUN_ID" || -z "$EVENT_TYPE" || -z "$EVENT_AT_UTC" ]]; then
|
||||||
|
echo "usage: $0 <run_id> <event_type> <event_at_utc> <detail_json_b64>" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
RUN_EVENTS_TABLE="${RUN_EVENTS_TABLE:-lake.db1.run_events}"
|
||||||
|
|
||||||
|
decode_b64() {
|
||||||
|
local s="$1"
|
||||||
|
if [[ -z "$s" ]]; then
|
||||||
|
printf ""
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
printf '%s' "$s" | base64 -d
|
||||||
|
}
|
||||||
|
|
||||||
|
escape_sql() {
|
||||||
|
sed "s/'/''/g"
|
||||||
|
}
|
||||||
|
|
||||||
|
DETAIL_JSON="$(decode_b64 "$DETAIL_JSON_B64" | escape_sql)"
|
||||||
|
RUN_ID_ESC="$(printf '%s' "$RUN_ID" | escape_sql)"
|
||||||
|
EVENT_TYPE_ESC="$(printf '%s' "$EVENT_TYPE" | escape_sql)"
|
||||||
|
EVENT_AT_ESC="$(printf '%s' "$EVENT_AT_UTC" | escape_sql)"
|
||||||
|
|
||||||
|
SQL="
|
||||||
|
CREATE TABLE IF NOT EXISTS ${RUN_EVENTS_TABLE} (
|
||||||
|
run_id STRING,
|
||||||
|
event_type STRING,
|
||||||
|
event_at_utc STRING,
|
||||||
|
detail_json STRING,
|
||||||
|
ingested_at_utc STRING
|
||||||
|
) USING iceberg;
|
||||||
|
|
||||||
|
INSERT INTO ${RUN_EVENTS_TABLE} VALUES (
|
||||||
|
'${RUN_ID_ESC}',
|
||||||
|
'${EVENT_TYPE_ESC}',
|
||||||
|
'${EVENT_AT_ESC}',
|
||||||
|
'${DETAIL_JSON}',
|
||||||
|
'${EVENT_AT_ESC}'
|
||||||
|
);
|
||||||
|
"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-sql \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
-e "$SQL"
|
||||||
92
record-run-via-spark-container.sh
Executable file
92
record-run-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,92 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Args:
|
||||||
|
# 1 run_id
|
||||||
|
# 2 run_type
|
||||||
|
# 3 status
|
||||||
|
# 4 started_at_utc
|
||||||
|
# 5 finished_at_utc (or empty)
|
||||||
|
# 6 actor
|
||||||
|
# 7 input_json_b64
|
||||||
|
# 8 output_json_b64
|
||||||
|
# 9 error_text_b64
|
||||||
|
RUN_ID="${1:-}"
|
||||||
|
RUN_TYPE="${2:-}"
|
||||||
|
STATUS="${3:-}"
|
||||||
|
STARTED_AT_UTC="${4:-}"
|
||||||
|
FINISHED_AT_UTC="${5:-}"
|
||||||
|
ACTOR="${6:-}"
|
||||||
|
INPUT_JSON_B64="${7:-}"
|
||||||
|
OUTPUT_JSON_B64="${8:-}"
|
||||||
|
ERROR_TEXT_B64="${9:-}"
|
||||||
|
|
||||||
|
if [[ -z "$RUN_ID" || -z "$RUN_TYPE" || -z "$STATUS" || -z "$STARTED_AT_UTC" ]]; then
|
||||||
|
echo "usage: $0 <run_id> <run_type> <status> <started_at_utc> <finished_at_utc> <actor> <input_json_b64> <output_json_b64> <error_text_b64>" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
RUNS_TABLE="${RUNS_TABLE:-lake.db1.runs}"
|
||||||
|
|
||||||
|
decode_b64() {
|
||||||
|
local s="$1"
|
||||||
|
if [[ -z "$s" ]]; then
|
||||||
|
printf ""
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
printf '%s' "$s" | base64 -d
|
||||||
|
}
|
||||||
|
|
||||||
|
escape_sql() {
|
||||||
|
sed "s/'/''/g"
|
||||||
|
}
|
||||||
|
|
||||||
|
INPUT_JSON="$(decode_b64 "$INPUT_JSON_B64" | escape_sql)"
|
||||||
|
OUTPUT_JSON="$(decode_b64 "$OUTPUT_JSON_B64" | escape_sql)"
|
||||||
|
ERROR_TEXT="$(decode_b64 "$ERROR_TEXT_B64" | escape_sql)"
|
||||||
|
RUN_ID_ESC="$(printf '%s' "$RUN_ID" | escape_sql)"
|
||||||
|
RUN_TYPE_ESC="$(printf '%s' "$RUN_TYPE" | escape_sql)"
|
||||||
|
STATUS_ESC="$(printf '%s' "$STATUS" | escape_sql)"
|
||||||
|
STARTED_ESC="$(printf '%s' "$STARTED_AT_UTC" | escape_sql)"
|
||||||
|
FINISHED_ESC="$(printf '%s' "$FINISHED_AT_UTC" | escape_sql)"
|
||||||
|
ACTOR_ESC="$(printf '%s' "$ACTOR" | escape_sql)"
|
||||||
|
|
||||||
|
SQL="
|
||||||
|
CREATE TABLE IF NOT EXISTS ${RUNS_TABLE} (
|
||||||
|
run_id STRING,
|
||||||
|
run_type STRING,
|
||||||
|
status STRING,
|
||||||
|
started_at_utc STRING,
|
||||||
|
finished_at_utc STRING,
|
||||||
|
actor STRING,
|
||||||
|
input_json STRING,
|
||||||
|
output_json STRING,
|
||||||
|
error_text STRING,
|
||||||
|
ingested_at_utc STRING
|
||||||
|
) USING iceberg;
|
||||||
|
|
||||||
|
INSERT INTO ${RUNS_TABLE} VALUES (
|
||||||
|
'${RUN_ID_ESC}',
|
||||||
|
'${RUN_TYPE_ESC}',
|
||||||
|
'${STATUS_ESC}',
|
||||||
|
'${STARTED_ESC}',
|
||||||
|
'${FINISHED_ESC}',
|
||||||
|
'${ACTOR_ESC}',
|
||||||
|
'${INPUT_JSON}',
|
||||||
|
'${OUTPUT_JSON}',
|
||||||
|
'${ERROR_TEXT}',
|
||||||
|
'${STARTED_ESC}'
|
||||||
|
);
|
||||||
|
"
|
||||||
|
|
||||||
|
docker exec \
|
||||||
|
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-sql \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
-e "$SQL"
|
||||||
607
release_projector.py
Normal file
607
release_projector.py
Normal file
|
|
@ -0,0 +1,607 @@
|
||||||
|
import argparse
|
||||||
|
import hashlib
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import urllib.error
|
||||||
|
import urllib.request
|
||||||
|
from datetime import date, datetime, timezone
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
try:
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
except Exception:
|
||||||
|
load_dotenv = None
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_SPARK_PACKAGES = (
|
||||||
|
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,"
|
||||||
|
"org.apache.iceberg:iceberg-aws-bundle:1.10.1,"
|
||||||
|
"org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def utc_now_iso() -> str:
|
||||||
|
return datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
|
||||||
|
def parse_json_maybe(value: Any, expected_type: type, fallback: Any) -> Any:
|
||||||
|
if value is None:
|
||||||
|
return fallback
|
||||||
|
if isinstance(value, expected_type):
|
||||||
|
return value
|
||||||
|
if isinstance(value, str):
|
||||||
|
try:
|
||||||
|
parsed = json.loads(value)
|
||||||
|
if isinstance(parsed, expected_type):
|
||||||
|
return parsed
|
||||||
|
except Exception:
|
||||||
|
return fallback
|
||||||
|
return fallback
|
||||||
|
|
||||||
|
|
||||||
|
def first_str(row: Dict[str, Any], keys: List[str]) -> Optional[str]:
|
||||||
|
for key in keys:
|
||||||
|
val = row.get(key)
|
||||||
|
if isinstance(val, str) and val.strip():
|
||||||
|
return val.strip()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def to_iso(value: Any) -> Optional[str]:
|
||||||
|
if isinstance(value, datetime):
|
||||||
|
return value.isoformat()
|
||||||
|
if isinstance(value, date):
|
||||||
|
return datetime.combine(value, datetime.min.time(), timezone.utc).isoformat()
|
||||||
|
if isinstance(value, str) and value.strip():
|
||||||
|
return value.strip()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def make_fingerprint(name: str, kind: Optional[str], external_ids: Dict[str, str]) -> str:
|
||||||
|
norm = (name or "").strip().lower()
|
||||||
|
kind_norm = (kind or "").strip().lower()
|
||||||
|
ext = "|".join(f"{k}:{v}".lower() for k, v in sorted(external_ids.items()))
|
||||||
|
raw = f"{norm}|{kind_norm}|{ext}"
|
||||||
|
return hashlib.sha256(raw.encode("utf-8")).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def load_manifest(path: str) -> Dict[str, Any]:
|
||||||
|
with open(path, "r", encoding="utf-8") as f:
|
||||||
|
raw = json.load(f)
|
||||||
|
|
||||||
|
if isinstance(raw, dict):
|
||||||
|
manifest_json = raw.get("manifest_json")
|
||||||
|
if isinstance(manifest_json, str):
|
||||||
|
try:
|
||||||
|
parsed = json.loads(manifest_json)
|
||||||
|
if isinstance(parsed, dict):
|
||||||
|
return parsed
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return raw
|
||||||
|
|
||||||
|
if isinstance(raw, list) and raw and isinstance(raw[0], dict):
|
||||||
|
manifest_json = raw[0].get("manifest_json")
|
||||||
|
if isinstance(manifest_json, str):
|
||||||
|
parsed = json.loads(manifest_json)
|
||||||
|
if isinstance(parsed, dict):
|
||||||
|
return parsed
|
||||||
|
|
||||||
|
raise ValueError("Manifest file must contain a manifest object or releases_v2 row with manifest_json.")
|
||||||
|
|
||||||
|
|
||||||
|
def infer_manifest_ref(manifest: Dict[str, Any]) -> Optional[str]:
|
||||||
|
nessie = manifest.get("nessie")
|
||||||
|
if isinstance(nessie, dict):
|
||||||
|
ref_obj = nessie.get("ref")
|
||||||
|
if isinstance(ref_obj, dict):
|
||||||
|
ref_name = ref_obj.get("name")
|
||||||
|
if isinstance(ref_name, str) and ref_name.strip():
|
||||||
|
return ref_name.strip()
|
||||||
|
tag = nessie.get("tag")
|
||||||
|
if isinstance(tag, str) and tag.strip():
|
||||||
|
return tag.strip()
|
||||||
|
|
||||||
|
release_obj = manifest.get("release")
|
||||||
|
if isinstance(release_obj, dict):
|
||||||
|
release_name = release_obj.get("name")
|
||||||
|
if isinstance(release_name, str) and release_name.strip():
|
||||||
|
return release_name.strip()
|
||||||
|
|
||||||
|
for key in ("nessie_tag", "tag", "release_name"):
|
||||||
|
val = manifest.get(key)
|
||||||
|
if isinstance(val, str) and val.strip():
|
||||||
|
return val.strip()
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def extract_table_identifiers(manifest: Dict[str, Any]) -> List[str]:
|
||||||
|
out: List[str] = []
|
||||||
|
tables = manifest.get("tables")
|
||||||
|
if isinstance(tables, list):
|
||||||
|
for t in tables:
|
||||||
|
if not isinstance(t, dict):
|
||||||
|
continue
|
||||||
|
ident = t.get("table_identifier") or t.get("identifier") or t.get("table")
|
||||||
|
if isinstance(ident, str) and ident.strip():
|
||||||
|
out.append(ident.strip())
|
||||||
|
|
||||||
|
if out:
|
||||||
|
return out
|
||||||
|
|
||||||
|
rows = manifest.get("rows")
|
||||||
|
if isinstance(rows, list):
|
||||||
|
for row in rows:
|
||||||
|
if not isinstance(row, dict):
|
||||||
|
continue
|
||||||
|
ident = row.get("table_identifier")
|
||||||
|
if isinstance(ident, str) and ident.strip():
|
||||||
|
out.append(ident.strip())
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def infer_concept_table(tables: List[str]) -> Optional[str]:
|
||||||
|
for t in tables:
|
||||||
|
lower = t.lower()
|
||||||
|
if "concept" in lower:
|
||||||
|
return t
|
||||||
|
return tables[0] if tables else None
|
||||||
|
|
||||||
|
|
||||||
|
def load_manifest_from_registry(
|
||||||
|
spark: Any,
|
||||||
|
catalog: str,
|
||||||
|
release_name: str,
|
||||||
|
releases_table: Optional[str] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
from pyspark.sql import functions as F
|
||||||
|
|
||||||
|
table = releases_table or os.getenv("RELEASES_TABLE", "db1.releases_v2")
|
||||||
|
if table.count(".") == 1:
|
||||||
|
table = f"{catalog}.{table}"
|
||||||
|
|
||||||
|
row = (
|
||||||
|
spark.table(table)
|
||||||
|
.where(F.col("release_name") == release_name)
|
||||||
|
.orderBy(F.col("ingested_at_utc").desc_nulls_last())
|
||||||
|
.select("manifest_json")
|
||||||
|
.limit(1)
|
||||||
|
.collect()
|
||||||
|
)
|
||||||
|
if not row:
|
||||||
|
raise ValueError(f"Release '{release_name}' not found in registry table {table}.")
|
||||||
|
|
||||||
|
manifest_json = row[0]["manifest_json"]
|
||||||
|
if not isinstance(manifest_json, str) or not manifest_json.strip():
|
||||||
|
raise ValueError(f"Release '{release_name}' has empty manifest_json in {table}.")
|
||||||
|
|
||||||
|
manifest = json.loads(manifest_json)
|
||||||
|
if not isinstance(manifest, dict):
|
||||||
|
raise ValueError(f"Release '{release_name}' manifest_json is not a JSON object.")
|
||||||
|
return manifest
|
||||||
|
|
||||||
|
|
||||||
|
def build_spark(ref: str):
|
||||||
|
try:
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
except Exception as e:
|
||||||
|
raise RuntimeError(
|
||||||
|
"pyspark is not installed. Install it or run this with spark-submit."
|
||||||
|
) from e
|
||||||
|
|
||||||
|
catalog = os.getenv("SPARK_CATALOG", "lake")
|
||||||
|
|
||||||
|
builder = (
|
||||||
|
SparkSession.builder.appName("release-projector")
|
||||||
|
.config(
|
||||||
|
"spark.sql.extensions",
|
||||||
|
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,"
|
||||||
|
"org.projectnessie.spark.extensions.NessieSparkSessionExtensions",
|
||||||
|
)
|
||||||
|
.config("spark.jars.packages", os.getenv("SPARK_PACKAGES", DEFAULT_SPARK_PACKAGES))
|
||||||
|
.config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog")
|
||||||
|
.config(f"spark.sql.catalog.{catalog}.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
|
||||||
|
.config(f"spark.sql.catalog.{catalog}.uri", os.getenv("NESSIE_URI", "http://lakehouse-core:19120/api/v2"))
|
||||||
|
.config(f"spark.sql.catalog.{catalog}.ref", ref)
|
||||||
|
.config(
|
||||||
|
f"spark.sql.catalog.{catalog}.warehouse",
|
||||||
|
os.getenv("NESSIE_WAREHOUSE", "s3a://lakehouse/warehouse"),
|
||||||
|
)
|
||||||
|
.config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
|
||||||
|
.config("spark.hadoop.fs.s3a.endpoint", os.getenv("S3_ENDPOINT", "http://lakehouse-core:9000"))
|
||||||
|
.config("spark.hadoop.fs.s3a.path.style.access", os.getenv("S3_PATH_STYLE", "true"))
|
||||||
|
.config(
|
||||||
|
"spark.hadoop.fs.s3a.access.key",
|
||||||
|
os.getenv("AWS_ACCESS_KEY_ID", os.getenv("MINIO_ROOT_USER", "minioadmin")),
|
||||||
|
)
|
||||||
|
.config(
|
||||||
|
"spark.hadoop.fs.s3a.secret.key",
|
||||||
|
os.getenv("AWS_SECRET_ACCESS_KEY", os.getenv("MINIO_ROOT_PASSWORD", "minioadmin")),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
spark_master = os.getenv("SPARK_MASTER")
|
||||||
|
if spark_master:
|
||||||
|
builder = builder.master(spark_master)
|
||||||
|
|
||||||
|
return builder.getOrCreate(), catalog
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_es_index(es_url: str, es_index: str) -> None:
|
||||||
|
mapping = {
|
||||||
|
"mappings": {
|
||||||
|
"properties": {
|
||||||
|
"concept_id": {"type": "keyword"},
|
||||||
|
"concept_type": {"type": "keyword"},
|
||||||
|
"display_name": {"type": "text"},
|
||||||
|
"description": {"type": "text"},
|
||||||
|
"text": {"type": "text"},
|
||||||
|
"source_table": {"type": "keyword"},
|
||||||
|
"source_pk": {"type": "keyword"},
|
||||||
|
"release_name": {"type": "keyword"},
|
||||||
|
"ref_hash": {"type": "keyword"},
|
||||||
|
"attributes_json": {"type": "text"},
|
||||||
|
"canonical_name": {"type": "text"},
|
||||||
|
"kind": {"type": "keyword"},
|
||||||
|
"aliases": {"type": "text"},
|
||||||
|
"tags": {"type": "keyword"},
|
||||||
|
"summary": {"type": "text"},
|
||||||
|
"latest_cid": {"type": "keyword"},
|
||||||
|
"fingerprint": {"type": "keyword"},
|
||||||
|
"created_at": {"type": "date"},
|
||||||
|
"updated_at": {"type": "date"},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
url = f"{es_url.rstrip('/')}/{es_index}"
|
||||||
|
req_get = urllib.request.Request(url, method="GET")
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(req_get, timeout=30) as resp:
|
||||||
|
if 200 <= resp.status < 300:
|
||||||
|
return
|
||||||
|
except urllib.error.HTTPError as e:
|
||||||
|
if e.code != 404:
|
||||||
|
raise
|
||||||
|
|
||||||
|
body = json.dumps(mapping).encode("utf-8")
|
||||||
|
req_put = urllib.request.Request(url, data=body, method="PUT")
|
||||||
|
req_put.add_header("Content-Type", "application/json")
|
||||||
|
with urllib.request.urlopen(req_put, timeout=30) as resp:
|
||||||
|
if resp.status >= 400:
|
||||||
|
raise RuntimeError(f"Failed to create ES index {es_index}: HTTP {resp.status}")
|
||||||
|
|
||||||
|
|
||||||
|
def es_upsert(es_url: str, es_index: str, doc: Dict[str, Any]) -> None:
|
||||||
|
url = f"{es_url.rstrip('/')}/{es_index}/_doc/{doc['concept_id']}"
|
||||||
|
body = json.dumps(doc, default=str).encode("utf-8")
|
||||||
|
req = urllib.request.Request(url, data=body, method="PUT")
|
||||||
|
req.add_header("Content-Type", "application/json")
|
||||||
|
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||||
|
if resp.status >= 400:
|
||||||
|
raise RuntimeError(f"Failed ES upsert for {doc['concept_id']}: HTTP {resp.status}")
|
||||||
|
|
||||||
|
|
||||||
|
def gremlin_upsert(gremlin_url: str, concept: Dict[str, Any]) -> None:
|
||||||
|
from gremlin_python.driver import client as gremlin_client
|
||||||
|
from gremlin_python.driver.serializer import GraphSONSerializersV3d0
|
||||||
|
|
||||||
|
created_at = concept.get("created_at") or utc_now_iso()
|
||||||
|
updated_at = concept.get("updated_at") or utc_now_iso()
|
||||||
|
|
||||||
|
query = """
|
||||||
|
g.V().hasLabel('Concept').has('concept_id', concept_id).fold()
|
||||||
|
.coalesce(
|
||||||
|
unfold(),
|
||||||
|
addV('Concept').property('concept_id', concept_id).property('created_at', created_at)
|
||||||
|
)
|
||||||
|
.property('canonical_name', canonical_name)
|
||||||
|
.property('kind', kind)
|
||||||
|
.property('concept_type', concept_type)
|
||||||
|
.property('display_name', display_name)
|
||||||
|
.property('description', description)
|
||||||
|
.property('text', text)
|
||||||
|
.property('source_table', source_table)
|
||||||
|
.property('source_pk', source_pk)
|
||||||
|
.property('release_name', release_name)
|
||||||
|
.property('ref_hash', ref_hash)
|
||||||
|
.property('attributes_json', attributes_json)
|
||||||
|
.property('aliases', aliases_json)
|
||||||
|
.property('external_ids', external_ids_json)
|
||||||
|
.property('tags', tags_json)
|
||||||
|
.property('fingerprint', fingerprint)
|
||||||
|
.property('latest_cid', latest_cid)
|
||||||
|
.property('summary', summary)
|
||||||
|
.property('updated_at', updated_at)
|
||||||
|
.values('concept_id')
|
||||||
|
"""
|
||||||
|
|
||||||
|
c = gremlin_client.Client(
|
||||||
|
gremlin_url,
|
||||||
|
"g",
|
||||||
|
message_serializer=GraphSONSerializersV3d0(),
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
c.submit(
|
||||||
|
query,
|
||||||
|
{
|
||||||
|
"concept_id": concept["concept_id"],
|
||||||
|
"canonical_name": concept.get("canonical_name") or "",
|
||||||
|
"kind": concept.get("kind") or "",
|
||||||
|
"concept_type": concept.get("concept_type") or "",
|
||||||
|
"display_name": concept.get("display_name") or "",
|
||||||
|
"description": concept.get("description") or "",
|
||||||
|
"text": concept.get("text") or "",
|
||||||
|
"source_table": concept.get("source_table") or "",
|
||||||
|
"source_pk": concept.get("source_pk") or "",
|
||||||
|
"release_name": concept.get("release_name") or "",
|
||||||
|
"ref_hash": concept.get("ref_hash") or "",
|
||||||
|
"attributes_json": concept.get("attributes_json") or "{}",
|
||||||
|
"aliases_json": json.dumps(concept.get("aliases", []), ensure_ascii=False),
|
||||||
|
"external_ids_json": json.dumps(concept.get("external_ids", {}), ensure_ascii=False),
|
||||||
|
"tags_json": json.dumps(concept.get("tags", []), ensure_ascii=False),
|
||||||
|
"fingerprint": concept["fingerprint"],
|
||||||
|
"latest_cid": concept.get("latest_cid") or "",
|
||||||
|
"summary": concept.get("summary") or "",
|
||||||
|
"created_at": created_at,
|
||||||
|
"updated_at": updated_at,
|
||||||
|
},
|
||||||
|
).all().result()
|
||||||
|
finally:
|
||||||
|
c.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _infer_concept_type(row: Dict[str, Any], source_table: Optional[str]) -> str:
|
||||||
|
explicit = first_str(row, ["concept_type", "kind", "type"])
|
||||||
|
if explicit:
|
||||||
|
return explicit.lower()
|
||||||
|
lower_table = (source_table or "").lower()
|
||||||
|
if "messages" in lower_table:
|
||||||
|
return "message"
|
||||||
|
if "docs" in lower_table or "documents" in lower_table:
|
||||||
|
return "document"
|
||||||
|
if "message_id" in row:
|
||||||
|
return "message"
|
||||||
|
if "doc_id" in row or "document_id" in row:
|
||||||
|
return "document"
|
||||||
|
return "entity"
|
||||||
|
|
||||||
|
|
||||||
|
def _source_pk(row: Dict[str, Any]) -> Optional[str]:
|
||||||
|
return first_str(row, ["source_pk", "message_id", "doc_id", "document_id", "id", "uuid"])
|
||||||
|
|
||||||
|
|
||||||
|
def row_to_concept(
|
||||||
|
row: Dict[str, Any],
|
||||||
|
source_table: Optional[str],
|
||||||
|
release_name: Optional[str],
|
||||||
|
ref_hash: Optional[str],
|
||||||
|
) -> Optional[Dict[str, Any]]:
|
||||||
|
concept_type = _infer_concept_type(row, source_table)
|
||||||
|
source_pk = _source_pk(row)
|
||||||
|
display_name = first_str(
|
||||||
|
row,
|
||||||
|
[
|
||||||
|
"display_name",
|
||||||
|
"canonical_name",
|
||||||
|
"title",
|
||||||
|
"name",
|
||||||
|
"subject",
|
||||||
|
"doc_name",
|
||||||
|
"document_name",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
if not display_name and source_pk:
|
||||||
|
display_name = f"{concept_type}:{source_pk}"
|
||||||
|
if not display_name:
|
||||||
|
display_name = first_str(row, ["body", "text", "content"])
|
||||||
|
if display_name:
|
||||||
|
display_name = display_name[:120]
|
||||||
|
if not display_name:
|
||||||
|
return None
|
||||||
|
|
||||||
|
external_ids = parse_json_maybe(row.get("external_ids"), dict, {})
|
||||||
|
aliases = parse_json_maybe(row.get("aliases"), list, [])
|
||||||
|
tags = parse_json_maybe(row.get("tags"), list, [])
|
||||||
|
|
||||||
|
kind = first_str(row, ["kind", "type", "doc_type", "document_type"]) or concept_type
|
||||||
|
|
||||||
|
concept_id = first_str(row, ["concept_id", "doc_id", "document_id", "id", "uuid"])
|
||||||
|
if not concept_id and source_pk:
|
||||||
|
concept_id = f"{concept_type}:{source_pk}"
|
||||||
|
if not isinstance(concept_id, str) or not concept_id.strip():
|
||||||
|
concept_id = hashlib.sha256(
|
||||||
|
f"{concept_type}|{display_name}|{json.dumps(external_ids, sort_keys=True)}".encode("utf-8")
|
||||||
|
).hexdigest()
|
||||||
|
|
||||||
|
description = first_str(row, ["description", "summary", "abstract"])
|
||||||
|
if not description:
|
||||||
|
body = first_str(row, ["content", "text", "body"])
|
||||||
|
if body:
|
||||||
|
description = body[:512]
|
||||||
|
|
||||||
|
text = first_str(row, ["text", "content", "body"])
|
||||||
|
if not text:
|
||||||
|
text = description
|
||||||
|
|
||||||
|
# Keep typed attributes stable and searchable without exploding ES mapping.
|
||||||
|
attributes_obj = row
|
||||||
|
|
||||||
|
return {
|
||||||
|
"concept_id": concept_id,
|
||||||
|
"concept_type": concept_type,
|
||||||
|
"display_name": display_name,
|
||||||
|
"description": description,
|
||||||
|
"text": text,
|
||||||
|
"source_table": source_table,
|
||||||
|
"source_pk": source_pk,
|
||||||
|
"release_name": release_name,
|
||||||
|
"ref_hash": ref_hash,
|
||||||
|
"attributes_json": json.dumps(attributes_obj, ensure_ascii=False, default=str, sort_keys=True),
|
||||||
|
"canonical_name": display_name,
|
||||||
|
"kind": kind,
|
||||||
|
"aliases": aliases,
|
||||||
|
"external_ids": external_ids,
|
||||||
|
"tags": tags,
|
||||||
|
"latest_cid": first_str(row, ["latest_cid", "cid", "ipfs_cid"]),
|
||||||
|
"summary": description,
|
||||||
|
"created_at": to_iso(row.get("created_at")) or utc_now_iso(),
|
||||||
|
"updated_at": to_iso(row.get("updated_at")) or utc_now_iso(),
|
||||||
|
"fingerprint": make_fingerprint(display_name, concept_type, external_ids),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def project_release(
|
||||||
|
manifest_file: Optional[str],
|
||||||
|
release_name: Optional[str],
|
||||||
|
concept_table: Optional[str],
|
||||||
|
nessie_ref: Optional[str],
|
||||||
|
releases_ref: Optional[str],
|
||||||
|
dry_run: bool,
|
||||||
|
targets: str,
|
||||||
|
) -> None:
|
||||||
|
if not manifest_file and not release_name:
|
||||||
|
raise ValueError("Provide either --manifest-file or --release-name.")
|
||||||
|
|
||||||
|
manifest: Optional[Dict[str, Any]] = load_manifest(manifest_file) if manifest_file else None
|
||||||
|
|
||||||
|
# Release-name mode: lookup manifest on registry ref (usually main), then project on release tag.
|
||||||
|
if manifest is None and release_name:
|
||||||
|
registry_ref = releases_ref or os.getenv("RELEASES_REF", "main")
|
||||||
|
spark, catalog = build_spark(registry_ref)
|
||||||
|
manifest = load_manifest_from_registry(spark, catalog, release_name)
|
||||||
|
ref = nessie_ref or infer_manifest_ref(manifest) or release_name
|
||||||
|
if ref != registry_ref:
|
||||||
|
spark.stop()
|
||||||
|
spark, catalog = build_spark(ref)
|
||||||
|
else:
|
||||||
|
ref = nessie_ref or (infer_manifest_ref(manifest) if manifest else None) or release_name
|
||||||
|
if not ref:
|
||||||
|
raise ValueError("Unable to infer Nessie ref/tag; pass --nessie-ref explicitly.")
|
||||||
|
spark, catalog = build_spark(ref)
|
||||||
|
|
||||||
|
table_identifiers: List[str] = extract_table_identifiers(manifest) if manifest else []
|
||||||
|
table = concept_table or (infer_concept_table(table_identifiers) if manifest else None)
|
||||||
|
if not table:
|
||||||
|
raise ValueError("Unable to infer concept table; pass --concept-table explicitly.")
|
||||||
|
|
||||||
|
if table.count(".") == 1:
|
||||||
|
table = f"{catalog}.{table}"
|
||||||
|
|
||||||
|
print(f"[INFO] Using Nessie ref/tag: {ref}")
|
||||||
|
print(f"[INFO] Reading table: {table}")
|
||||||
|
|
||||||
|
release_name_effective = None
|
||||||
|
ref_hash = None
|
||||||
|
if manifest:
|
||||||
|
rel = manifest.get("release")
|
||||||
|
if isinstance(rel, dict):
|
||||||
|
rel_name = rel.get("name")
|
||||||
|
if isinstance(rel_name, str) and rel_name.strip():
|
||||||
|
release_name_effective = rel_name.strip()
|
||||||
|
nes = manifest.get("nessie")
|
||||||
|
if isinstance(nes, dict):
|
||||||
|
ref_obj = nes.get("ref")
|
||||||
|
if isinstance(ref_obj, dict):
|
||||||
|
h = ref_obj.get("hash")
|
||||||
|
if isinstance(h, str) and h.strip():
|
||||||
|
ref_hash = h.strip()
|
||||||
|
if not release_name_effective and release_name and isinstance(release_name, str) and release_name.strip():
|
||||||
|
release_name_effective = release_name.strip()
|
||||||
|
|
||||||
|
df = spark.table(table)
|
||||||
|
rows = [r.asDict(recursive=True) for r in df.collect()]
|
||||||
|
concepts = [c for c in (row_to_concept(r, table, release_name_effective, ref_hash) for r in rows) if c]
|
||||||
|
|
||||||
|
print(f"[INFO] Read {len(rows)} rows, {len(concepts)} valid concepts")
|
||||||
|
print("[STEP] spark_read_done")
|
||||||
|
if dry_run:
|
||||||
|
print("[INFO] Dry-run enabled. No writes performed.")
|
||||||
|
return
|
||||||
|
|
||||||
|
use_es = targets in ("both", "es")
|
||||||
|
use_gremlin = targets in ("both", "gremlin")
|
||||||
|
print(f"[INFO] Projection targets: {targets}")
|
||||||
|
|
||||||
|
gremlin_url = os.getenv("GREMLIN_URL", "ws://localhost:8182/gremlin")
|
||||||
|
es_url = os.getenv("ES_URL", "http://localhost:9200")
|
||||||
|
es_index = os.getenv("ES_INDEX", "concepts")
|
||||||
|
|
||||||
|
if use_es:
|
||||||
|
ensure_es_index(es_url, es_index)
|
||||||
|
|
||||||
|
success = 0
|
||||||
|
failures = 0
|
||||||
|
gremlin_missing = False
|
||||||
|
es_missing = False
|
||||||
|
for concept in concepts:
|
||||||
|
try:
|
||||||
|
wrote_any = False
|
||||||
|
if use_gremlin and not gremlin_missing:
|
||||||
|
try:
|
||||||
|
gremlin_upsert(gremlin_url, concept)
|
||||||
|
wrote_any = True
|
||||||
|
except ModuleNotFoundError as e:
|
||||||
|
gremlin_missing = True
|
||||||
|
print(f"[WARN] Gremlin dependency missing ({e}). Continuing with ES only.")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[WARN] Gremlin upsert failed for {concept.get('concept_id')}: {e}")
|
||||||
|
|
||||||
|
if use_es and not es_missing:
|
||||||
|
try:
|
||||||
|
es_upsert(es_url, es_index, concept)
|
||||||
|
wrote_any = True
|
||||||
|
except ModuleNotFoundError as e:
|
||||||
|
es_missing = True
|
||||||
|
print(f"[WARN] ES dependency missing ({e}). Continuing with Gremlin only.")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[WARN] ES upsert failed for {concept.get('concept_id')}: {e}")
|
||||||
|
|
||||||
|
if wrote_any:
|
||||||
|
success += 1
|
||||||
|
else:
|
||||||
|
failures += 1
|
||||||
|
print(f"[WARN] No projection target succeeded for {concept.get('concept_id')}")
|
||||||
|
except Exception as e:
|
||||||
|
failures += 1
|
||||||
|
print(f"[WARN] Failed concept {concept.get('concept_id')}: {e}")
|
||||||
|
|
||||||
|
print("[STEP] projection_done")
|
||||||
|
print(f"[DONE] Projected {success} concepts ({failures} failed)")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_args() -> argparse.Namespace:
|
||||||
|
p = argparse.ArgumentParser(description="Project a lakehouse release into JanusGraph + Elasticsearch.")
|
||||||
|
p.add_argument("--manifest-file", help="Path to release manifest JSON")
|
||||||
|
p.add_argument("--release-name", help="Release name to load from releases_v2 registry")
|
||||||
|
p.add_argument("--concept-table", help="Full Iceberg table identifier holding concepts")
|
||||||
|
p.add_argument("--nessie-ref", help="Nessie branch/tag to read from (defaults to manifest tag)")
|
||||||
|
p.add_argument("--releases-ref", help="Nessie ref used to read releases_v2 (default: main)")
|
||||||
|
p.add_argument(
|
||||||
|
"--targets",
|
||||||
|
choices=["es", "gremlin", "both"],
|
||||||
|
default="both",
|
||||||
|
help="Projection targets to write (default: both)",
|
||||||
|
)
|
||||||
|
p.add_argument("--dry-run", action="store_true", help="Read and validate only")
|
||||||
|
return p.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
if load_dotenv is not None:
|
||||||
|
load_dotenv()
|
||||||
|
args = parse_args()
|
||||||
|
project_release(
|
||||||
|
manifest_file=args.manifest_file,
|
||||||
|
release_name=args.release_name,
|
||||||
|
concept_table=args.concept_table,
|
||||||
|
nessie_ref=args.nessie_ref,
|
||||||
|
releases_ref=args.releases_ref,
|
||||||
|
dry_run=args.dry_run,
|
||||||
|
targets=args.targets,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
8
requirements-app.txt
Normal file
8
requirements-app.txt
Normal file
|
|
@ -0,0 +1,8 @@
|
||||||
|
fastapi>=0.115,<1.0
|
||||||
|
uvicorn[standard]>=0.32,<1.0
|
||||||
|
pydantic>=2.9,<3.0
|
||||||
|
httpx>=0.28,<1.0
|
||||||
|
gremlinpython>=3.7,<4.0
|
||||||
|
python-dotenv>=1.0,<2.0
|
||||||
|
requests>=2.32,<3.0
|
||||||
|
websocket-client>=1.8,<2.0
|
||||||
4
requirements-projector.txt
Normal file
4
requirements-projector.txt
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
pyspark==3.5.8
|
||||||
|
python-dotenv>=1.0,<2.0
|
||||||
|
httpx>=0.28,<1.0
|
||||||
|
gremlinpython>=3.7,<4.0
|
||||||
67
run-projector-standard.sh
Executable file
67
run-projector-standard.sh
Executable file
|
|
@ -0,0 +1,67 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Canonical projector command for lakehouse-core.
|
||||||
|
# Usage:
|
||||||
|
# ./run-projector-standard.sh # publish (both targets)
|
||||||
|
# ./run-projector-standard.sh --dry-run # validate only
|
||||||
|
# ./run-projector-standard.sh --targets es # ES-only publish
|
||||||
|
# ./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
|
||||||
|
|
||||||
|
MANIFEST_FILE="${MANIFEST_FILE:-./manifests/rel_2026-02-14_docs-v1.json}"
|
||||||
|
CONCEPT_TABLE="${CONCEPT_TABLE:-lake.db1.docs}"
|
||||||
|
TARGETS="${TARGETS:-both}"
|
||||||
|
RELEASE_NAME="${RELEASE_NAME:-}"
|
||||||
|
MODE=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--dry-run)
|
||||||
|
MODE="--dry-run"
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--targets)
|
||||||
|
TARGETS="${2:-}"
|
||||||
|
if [[ -z "$TARGETS" ]]; then
|
||||||
|
echo "--targets requires one of: es|gremlin|both" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--manifest-file)
|
||||||
|
MANIFEST_FILE="${2:-}"
|
||||||
|
if [[ -z "$MANIFEST_FILE" ]]; then
|
||||||
|
echo "--manifest-file requires a value" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--release-name)
|
||||||
|
RELEASE_NAME="${2:-}"
|
||||||
|
if [[ -z "$RELEASE_NAME" ]]; then
|
||||||
|
echo "--release-name requires a value" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--concept-table)
|
||||||
|
CONCEPT_TABLE="${2:-}"
|
||||||
|
if [[ -z "$CONCEPT_TABLE" ]]; then
|
||||||
|
echo "--concept-table requires a value" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown argument: $1" >&2
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$TARGETS" != "es" && "$TARGETS" != "gremlin" && "$TARGETS" != "both" ]]; then
|
||||||
|
echo "Invalid --targets value: $TARGETS (expected es|gremlin|both)" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
./run-projector-via-spark-container.sh "$MANIFEST_FILE" "$CONCEPT_TABLE" "$MODE" "$TARGETS" "$RELEASE_NAME"
|
||||||
63
run-projector-via-spark-container.sh
Executable file
63
run-projector-via-spark-container.sh
Executable file
|
|
@ -0,0 +1,63 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
MANIFEST_FILE="${1:-/tmp/rel_2026-02-14_docs-v1.json}"
|
||||||
|
CONCEPT_TABLE="${2:-lake.db1.docs}"
|
||||||
|
MODE="${3:-}"
|
||||||
|
TARGETS="${4:-both}"
|
||||||
|
RELEASE_NAME="${5:-${RELEASE_NAME:-}}"
|
||||||
|
|
||||||
|
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
|
||||||
|
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
|
||||||
|
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
|
||||||
|
|
||||||
|
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./release_projector.py}"
|
||||||
|
SCRIPT_REMOTE="/tmp/release_projector.py"
|
||||||
|
MANIFEST_REMOTE="/tmp/$(basename "$MANIFEST_FILE")"
|
||||||
|
|
||||||
|
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
|
||||||
|
echo "release_projector.py not found at: $SCRIPT_LOCAL" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$RELEASE_NAME" && ! -f "$MANIFEST_FILE" ]]; then
|
||||||
|
echo "manifest file not found: $MANIFEST_FILE (or provide release name arg5)" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
|
||||||
|
if [[ -f "$MANIFEST_FILE" ]]; then
|
||||||
|
docker cp "$MANIFEST_FILE" "$CONTAINER_NAME":"$MANIFEST_REMOTE"
|
||||||
|
fi
|
||||||
|
|
||||||
|
ARGS=(
|
||||||
|
"$SCRIPT_REMOTE"
|
||||||
|
"--concept-table" "$CONCEPT_TABLE"
|
||||||
|
"--targets" "$TARGETS"
|
||||||
|
)
|
||||||
|
|
||||||
|
if [[ -n "$RELEASE_NAME" ]]; then
|
||||||
|
ARGS+=("--release-name" "$RELEASE_NAME")
|
||||||
|
else
|
||||||
|
ARGS+=("--manifest-file" "$MANIFEST_REMOTE")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -n "$MODE" ]]; then
|
||||||
|
ARGS+=("$MODE")
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker exec -e AWS_REGION="${AWS_REGION:-us-east-1}" \
|
||||||
|
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
|
||||||
|
-e NESSIE_URI="${NESSIE_URI:-http://lakehouse-core:19120/api/v2}" \
|
||||||
|
-e NESSIE_WAREHOUSE="${NESSIE_WAREHOUSE:-s3a://lakehouse/warehouse}" \
|
||||||
|
-e S3_ENDPOINT="${S3_ENDPOINT:-http://lakehouse-core:9000}" \
|
||||||
|
-e AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID:-minioadmin}" \
|
||||||
|
-e AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY:-minioadmin}" \
|
||||||
|
-e GREMLIN_URL="${GREMLIN_URL:-ws://janus.rakeroots.lan:8182/gremlin}" \
|
||||||
|
-e ES_URL="${ES_URL:-http://janus.rakeroots.lan:9200}" \
|
||||||
|
-e ES_INDEX="${ES_INDEX:-concepts}" \
|
||||||
|
"$CONTAINER_NAME" \
|
||||||
|
/opt/spark/bin/spark-submit \
|
||||||
|
--properties-file "$SPARK_PROPS" \
|
||||||
|
--packages "$PACKAGES" \
|
||||||
|
"${ARGS[@]}"
|
||||||
11
setup_local_env.sh
Executable file
11
setup_local_env.sh
Executable file
|
|
@ -0,0 +1,11 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
VENV_DIR="${1:-.venv}"
|
||||||
|
|
||||||
|
python3 -m venv "$VENV_DIR"
|
||||||
|
"$VENV_DIR/bin/pip" install --upgrade pip
|
||||||
|
"$VENV_DIR/bin/pip" install -r requirements-app.txt -r requirements-projector.txt
|
||||||
|
|
||||||
|
echo "Environment ready: $VENV_DIR"
|
||||||
|
echo "Activate with: source $VENV_DIR/bin/activate"
|
||||||
215
ui/assets/app.js
Normal file
215
ui/assets/app.js
Normal file
|
|
@ -0,0 +1,215 @@
|
||||||
|
function getConfig() {
|
||||||
|
return {
|
||||||
|
apiKey: document.getElementById("apiKey").value.trim(),
|
||||||
|
releaseName: document.getElementById("releaseName").value.trim(),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
function saveConfig() {
|
||||||
|
const cfg = getConfig();
|
||||||
|
cfg.chatSessionId = document.getElementById("chatSessionId").value.trim();
|
||||||
|
localStorage.setItem("assistant_ui_cfg", JSON.stringify(cfg));
|
||||||
|
}
|
||||||
|
|
||||||
|
function loadConfig() {
|
||||||
|
try {
|
||||||
|
const raw = localStorage.getItem("assistant_ui_cfg");
|
||||||
|
if (!raw) return;
|
||||||
|
const cfg = JSON.parse(raw);
|
||||||
|
document.getElementById("apiKey").value = cfg.apiKey || "";
|
||||||
|
document.getElementById("releaseName").value = cfg.releaseName || "";
|
||||||
|
document.getElementById("chatSessionId").value = cfg.chatSessionId || "main";
|
||||||
|
} catch (_) {}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function apiGet(path, params) {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const url = new URL(path, window.location.origin);
|
||||||
|
Object.entries(params || {}).forEach(([k, v]) => {
|
||||||
|
if (v !== null && v !== undefined && String(v).length > 0) url.searchParams.set(k, String(v));
|
||||||
|
});
|
||||||
|
const r = await fetch(url, {
|
||||||
|
headers: { "X-Admin-Api-Key": cfg.apiKey },
|
||||||
|
});
|
||||||
|
if (!r.ok) throw new Error(await r.text());
|
||||||
|
return r.json();
|
||||||
|
}
|
||||||
|
|
||||||
|
async function apiPost(path, payload) {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const r = await fetch(path, {
|
||||||
|
method: "POST",
|
||||||
|
headers: {
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
"X-Admin-Api-Key": cfg.apiKey,
|
||||||
|
},
|
||||||
|
body: JSON.stringify(payload),
|
||||||
|
});
|
||||||
|
if (!r.ok) throw new Error(await r.text());
|
||||||
|
return r.json();
|
||||||
|
}
|
||||||
|
|
||||||
|
function renderRows(target, rows, formatter) {
|
||||||
|
target.innerHTML = "";
|
||||||
|
if (!rows || rows.length === 0) {
|
||||||
|
target.innerHTML = '<div class="row">No rows.</div>';
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
rows.forEach((row) => {
|
||||||
|
const el = document.createElement("div");
|
||||||
|
el.className = "row";
|
||||||
|
el.innerHTML = formatter(row);
|
||||||
|
target.appendChild(el);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
async function loadInbox() {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const q = document.getElementById("inboxQuery").value.trim();
|
||||||
|
const out = document.getElementById("inboxResults");
|
||||||
|
out.innerHTML = '<div class="row">Loading...</div>';
|
||||||
|
try {
|
||||||
|
const data = await apiGet("/assistant/inbox", { release_name: cfg.releaseName, q, limit: 20 });
|
||||||
|
renderRows(out, data.rows || [], (r) => {
|
||||||
|
const text = (r.text || r.summary || r.description || "").slice(0, 280);
|
||||||
|
return `
|
||||||
|
<div><strong>${r.display_name || r.concept_id || "message"}</strong></div>
|
||||||
|
<div>${text || "(no text)"}</div>
|
||||||
|
<div class="meta">${r.source_pk || ""} | ${r.release_name || ""}</div>
|
||||||
|
`;
|
||||||
|
});
|
||||||
|
} catch (e) {
|
||||||
|
out.innerHTML = `<div class="row">Error: ${String(e)}</div>`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function loadTasks() {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const onlyPending = document.getElementById("onlyPending").checked;
|
||||||
|
const out = document.getElementById("taskResults");
|
||||||
|
out.innerHTML = '<div class="row">Loading...</div>';
|
||||||
|
try {
|
||||||
|
const data = await apiGet("/assistant/tasks", {
|
||||||
|
release_name: cfg.releaseName,
|
||||||
|
only_pending: onlyPending,
|
||||||
|
limit: 30,
|
||||||
|
});
|
||||||
|
renderRows(out, data.rows || [], (r) => {
|
||||||
|
const safeTodo = (r.todo || "").replace(/"/g, """);
|
||||||
|
return `
|
||||||
|
<div><strong>${r.todo || "(empty task)"}</strong></div>
|
||||||
|
<div class="meta">status=${r.status} | due=${r.due_hint || "-"} | who=${r.who || "-"}</div>
|
||||||
|
<div class="meta">source=${r.source_pk || ""} | release=${r.release_name || ""}</div>
|
||||||
|
<div style="margin-top:6px"><button data-goal="${safeTodo}" class="use-goal">Use as goal</button></div>
|
||||||
|
`;
|
||||||
|
});
|
||||||
|
document.querySelectorAll(".use-goal").forEach((btn) => {
|
||||||
|
btn.addEventListener("click", () => {
|
||||||
|
const goal = btn.getAttribute("data-goal") || "";
|
||||||
|
document.getElementById("goalText").value = goal;
|
||||||
|
});
|
||||||
|
});
|
||||||
|
} catch (e) {
|
||||||
|
out.innerHTML = `<div class="row">Error: ${String(e)}</div>`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function makeDraft() {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const goal = document.getElementById("goalText").value.trim();
|
||||||
|
const recipient = document.getElementById("recipient").value.trim();
|
||||||
|
const out = document.getElementById("draftOutput");
|
||||||
|
if (!goal) {
|
||||||
|
out.textContent = "Provide goal text first.";
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
out.textContent = "Generating...";
|
||||||
|
try {
|
||||||
|
const data = await apiPost("/assistant/draft", {
|
||||||
|
task_type: "message",
|
||||||
|
goal,
|
||||||
|
recipient: recipient || null,
|
||||||
|
tone: "friendly-professional",
|
||||||
|
constraints: ["keep it concise"],
|
||||||
|
release_name: cfg.releaseName || null,
|
||||||
|
max_sources: 5,
|
||||||
|
});
|
||||||
|
const sourceLine = (data.sources || []).map((s) => s.concept_id).filter(Boolean).slice(0, 5).join(", ");
|
||||||
|
out.textContent = `${data.draft || ""}\n\nconfidence=${data.confidence}\nneeds_review=${data.needs_review}\nsources=${sourceLine}`;
|
||||||
|
} catch (e) {
|
||||||
|
out.textContent = `Error: ${String(e)}`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function saveLearn() {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const title = document.getElementById("learnTitle").value.trim();
|
||||||
|
const tags = document.getElementById("learnTags").value
|
||||||
|
.split(",")
|
||||||
|
.map((x) => x.trim())
|
||||||
|
.filter(Boolean);
|
||||||
|
const text = document.getElementById("learnText").value.trim();
|
||||||
|
const out = document.getElementById("learnOutput");
|
||||||
|
if (!text) {
|
||||||
|
out.textContent = "Provide note text first.";
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
out.textContent = "Saving...";
|
||||||
|
try {
|
||||||
|
const data = await apiPost("/assistant/learn", {
|
||||||
|
text,
|
||||||
|
title: title || null,
|
||||||
|
tags,
|
||||||
|
release_name: cfg.releaseName || null,
|
||||||
|
});
|
||||||
|
out.textContent = `saved=${data.stored}\nconcept_id=${data.concept_id}\ntitle=${data.title}`;
|
||||||
|
document.getElementById("learnText").value = "";
|
||||||
|
} catch (e) {
|
||||||
|
out.textContent = `Error: ${String(e)}`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function appendChat(role, text, meta) {
|
||||||
|
const target = document.getElementById("chatTranscript");
|
||||||
|
const el = document.createElement("div");
|
||||||
|
el.className = "row";
|
||||||
|
el.innerHTML = `
|
||||||
|
<div><strong>${role}</strong></div>
|
||||||
|
<div>${(text || "").replace(/\n/g, "<br/>")}</div>
|
||||||
|
${meta ? `<div class="meta">${meta}</div>` : ""}
|
||||||
|
`;
|
||||||
|
target.prepend(el);
|
||||||
|
}
|
||||||
|
|
||||||
|
async function sendChat() {
|
||||||
|
const cfg = getConfig();
|
||||||
|
const sessionInput = document.getElementById("chatSessionId");
|
||||||
|
const session_id = (sessionInput.value || "main").trim();
|
||||||
|
sessionInput.value = session_id;
|
||||||
|
const messageEl = document.getElementById("chatMessage");
|
||||||
|
const message = messageEl.value.trim();
|
||||||
|
if (!message) return;
|
||||||
|
appendChat("user", message, `session=${session_id}`);
|
||||||
|
messageEl.value = "";
|
||||||
|
try {
|
||||||
|
const data = await apiPost("/assistant/chat", {
|
||||||
|
session_id,
|
||||||
|
message,
|
||||||
|
release_name: cfg.releaseName || null,
|
||||||
|
max_sources: 6,
|
||||||
|
});
|
||||||
|
const sourceLine = (data.sources || []).map((s) => s.concept_id).filter(Boolean).slice(0, 4).join(", ");
|
||||||
|
appendChat("assistant", data.answer || "", `confidence=${data.confidence} | sources=${sourceLine || "-"}`);
|
||||||
|
} catch (e) {
|
||||||
|
appendChat("assistant", `Error: ${String(e)}`, "");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
document.getElementById("saveConfig").addEventListener("click", saveConfig);
|
||||||
|
document.getElementById("loadInbox").addEventListener("click", loadInbox);
|
||||||
|
document.getElementById("loadTasks").addEventListener("click", loadTasks);
|
||||||
|
document.getElementById("makeDraft").addEventListener("click", makeDraft);
|
||||||
|
document.getElementById("saveLearn").addEventListener("click", saveLearn);
|
||||||
|
document.getElementById("sendChat").addEventListener("click", sendChat);
|
||||||
|
|
||||||
|
loadConfig();
|
||||||
124
ui/assets/styles.css
Normal file
124
ui/assets/styles.css
Normal file
|
|
@ -0,0 +1,124 @@
|
||||||
|
:root {
|
||||||
|
--bg: #f2f4f5;
|
||||||
|
--panel: #ffffff;
|
||||||
|
--ink: #182126;
|
||||||
|
--muted: #5c6770;
|
||||||
|
--line: #dde4e8;
|
||||||
|
--accent: #0f766e;
|
||||||
|
}
|
||||||
|
|
||||||
|
* {
|
||||||
|
box-sizing: border-box;
|
||||||
|
}
|
||||||
|
|
||||||
|
body {
|
||||||
|
margin: 0;
|
||||||
|
font-family: "IBM Plex Sans", "Segoe UI", sans-serif;
|
||||||
|
color: var(--ink);
|
||||||
|
background: linear-gradient(165deg, #e9eff2 0%, #f8fafb 100%);
|
||||||
|
}
|
||||||
|
|
||||||
|
.layout {
|
||||||
|
max-width: 1100px;
|
||||||
|
margin: 0 auto;
|
||||||
|
padding: 18px;
|
||||||
|
display: grid;
|
||||||
|
gap: 14px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.topbar {
|
||||||
|
background: var(--panel);
|
||||||
|
border: 1px solid var(--line);
|
||||||
|
border-radius: 10px;
|
||||||
|
padding: 12px;
|
||||||
|
display: flex;
|
||||||
|
justify-content: space-between;
|
||||||
|
align-items: center;
|
||||||
|
gap: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.topbar h1,
|
||||||
|
.panel h2 {
|
||||||
|
margin: 0;
|
||||||
|
font-size: 18px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.panel {
|
||||||
|
background: var(--panel);
|
||||||
|
border: 1px solid var(--line);
|
||||||
|
border-radius: 10px;
|
||||||
|
padding: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.panel-header {
|
||||||
|
display: flex;
|
||||||
|
justify-content: space-between;
|
||||||
|
align-items: center;
|
||||||
|
gap: 12px;
|
||||||
|
margin-bottom: 8px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.controls {
|
||||||
|
display: flex;
|
||||||
|
gap: 8px;
|
||||||
|
align-items: center;
|
||||||
|
flex-wrap: wrap;
|
||||||
|
}
|
||||||
|
|
||||||
|
input,
|
||||||
|
textarea,
|
||||||
|
button {
|
||||||
|
font: inherit;
|
||||||
|
}
|
||||||
|
|
||||||
|
input,
|
||||||
|
textarea {
|
||||||
|
border: 1px solid var(--line);
|
||||||
|
border-radius: 7px;
|
||||||
|
padding: 8px;
|
||||||
|
background: #fff;
|
||||||
|
}
|
||||||
|
|
||||||
|
button {
|
||||||
|
border: 1px solid #0d5f59;
|
||||||
|
background: var(--accent);
|
||||||
|
color: #fff;
|
||||||
|
border-radius: 7px;
|
||||||
|
padding: 8px 10px;
|
||||||
|
cursor: pointer;
|
||||||
|
}
|
||||||
|
|
||||||
|
button:hover {
|
||||||
|
filter: brightness(0.95);
|
||||||
|
}
|
||||||
|
|
||||||
|
.list {
|
||||||
|
display: grid;
|
||||||
|
gap: 8px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.row {
|
||||||
|
border: 1px solid var(--line);
|
||||||
|
border-radius: 8px;
|
||||||
|
padding: 8px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.row .meta {
|
||||||
|
color: var(--muted);
|
||||||
|
font-size: 12px;
|
||||||
|
margin-top: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.output {
|
||||||
|
white-space: pre-wrap;
|
||||||
|
border: 1px solid var(--line);
|
||||||
|
border-radius: 8px;
|
||||||
|
padding: 10px;
|
||||||
|
min-height: 96px;
|
||||||
|
background: #fbfdfe;
|
||||||
|
}
|
||||||
|
|
||||||
|
#chatTranscript {
|
||||||
|
max-height: 360px;
|
||||||
|
overflow: auto;
|
||||||
|
}
|
||||||
82
ui/index.html
Normal file
82
ui/index.html
Normal file
|
|
@ -0,0 +1,82 @@
|
||||||
|
<!doctype html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8" />
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||||
|
<title>Jecio Assistant Console</title>
|
||||||
|
<link rel="stylesheet" href="/ui/assets/styles.css" />
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<main class="layout">
|
||||||
|
<header class="topbar">
|
||||||
|
<h1>Assistant Console</h1>
|
||||||
|
<div class="controls">
|
||||||
|
<input id="apiKey" type="password" placeholder="X-Admin-Api-Key" />
|
||||||
|
<input id="releaseName" type="text" placeholder="release_name (optional)" />
|
||||||
|
<button id="saveConfig">Save</button>
|
||||||
|
</div>
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<section class="panel">
|
||||||
|
<div class="panel-header">
|
||||||
|
<h2>Inbox</h2>
|
||||||
|
<div class="controls">
|
||||||
|
<input id="inboxQuery" type="text" placeholder="Search text (optional)" />
|
||||||
|
<button id="loadInbox">Load Inbox</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div id="inboxResults" class="list"></div>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section class="panel">
|
||||||
|
<div class="panel-header">
|
||||||
|
<h2>Pending Tasks</h2>
|
||||||
|
<div class="controls">
|
||||||
|
<label><input id="onlyPending" type="checkbox" checked /> Only pending</label>
|
||||||
|
<button id="loadTasks">Load Tasks</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div id="taskResults" class="list"></div>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section class="panel">
|
||||||
|
<div class="panel-header">
|
||||||
|
<h2>Draft</h2>
|
||||||
|
<div class="controls">
|
||||||
|
<input id="recipient" type="text" placeholder="Recipient (optional)" />
|
||||||
|
<button id="makeDraft">Draft From Goal</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<textarea id="goalText" rows="3" placeholder="Goal text (or click 'Use as goal' from a task)"></textarea>
|
||||||
|
<pre id="draftOutput" class="output"></pre>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section class="panel">
|
||||||
|
<div class="panel-header">
|
||||||
|
<h2>Learn</h2>
|
||||||
|
<div class="controls">
|
||||||
|
<input id="learnTitle" type="text" placeholder="Title (optional)" />
|
||||||
|
<input id="learnTags" type="text" placeholder="tags comma-separated (optional)" />
|
||||||
|
<button id="saveLearn">Save Note</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<textarea id="learnText" rows="3" placeholder="Knowledge note you want the assistant to remember"></textarea>
|
||||||
|
<pre id="learnOutput" class="output"></pre>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section class="panel">
|
||||||
|
<div class="panel-header">
|
||||||
|
<h2>Chat</h2>
|
||||||
|
<div class="controls">
|
||||||
|
<input id="chatSessionId" type="text" placeholder="session_id (default: main)" />
|
||||||
|
<button id="sendChat">Send</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<textarea id="chatMessage" rows="2" placeholder="Ask the assistant..."></textarea>
|
||||||
|
<div id="chatTranscript" class="list"></div>
|
||||||
|
</section>
|
||||||
|
</main>
|
||||||
|
|
||||||
|
<script src="/ui/assets/app.js"></script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
106
write_assistant_action.py
Normal file
106
write_assistant_action.py
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import base64
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession, types as T
|
||||||
|
|
||||||
|
|
||||||
|
def d(s: str) -> str:
|
||||||
|
if not s:
|
||||||
|
return ""
|
||||||
|
return base64.b64decode(s.encode("ascii")).decode("utf-8")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Write assistant action row via Spark DataFrame")
|
||||||
|
p.add_argument("--table", required=True)
|
||||||
|
p.add_argument("--action-id", required=True)
|
||||||
|
p.add_argument("--created-at-utc", required=True)
|
||||||
|
p.add_argument("--task-type", required=True)
|
||||||
|
p.add_argument("--release-name", default="")
|
||||||
|
p.add_argument("--objective-b64", default="")
|
||||||
|
p.add_argument("--step-id", required=True)
|
||||||
|
p.add_argument("--step-title-b64", default="")
|
||||||
|
p.add_argument("--action-type", required=True)
|
||||||
|
p.add_argument("--requires-approval", default="false")
|
||||||
|
p.add_argument("--approved", default="false")
|
||||||
|
p.add_argument("--status", required=True)
|
||||||
|
p.add_argument("--output-b64", default="")
|
||||||
|
p.add_argument("--error-b64", default="")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
requires_approval = str(args.requires_approval).lower() == "true"
|
||||||
|
approved = str(args.approved).lower() == "true"
|
||||||
|
objective = d(args.objective_b64)
|
||||||
|
step_title = d(args.step_title_b64)
|
||||||
|
output_json = d(args.output_b64)
|
||||||
|
error_text = d(args.error_b64)
|
||||||
|
if not output_json:
|
||||||
|
output_json = "{}"
|
||||||
|
try:
|
||||||
|
json.loads(output_json)
|
||||||
|
except Exception:
|
||||||
|
output_json = "{}"
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("write-assistant-action").getOrCreate()
|
||||||
|
spark.sql(
|
||||||
|
f"""
|
||||||
|
CREATE TABLE IF NOT EXISTS {args.table} (
|
||||||
|
action_id STRING,
|
||||||
|
created_at_utc STRING,
|
||||||
|
task_type STRING,
|
||||||
|
release_name STRING,
|
||||||
|
objective STRING,
|
||||||
|
step_id STRING,
|
||||||
|
step_title STRING,
|
||||||
|
action_type STRING,
|
||||||
|
requires_approval BOOLEAN,
|
||||||
|
approved BOOLEAN,
|
||||||
|
status STRING,
|
||||||
|
output_json STRING,
|
||||||
|
error_text STRING
|
||||||
|
) USING iceberg
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
schema = T.StructType(
|
||||||
|
[
|
||||||
|
T.StructField("action_id", T.StringType(), False),
|
||||||
|
T.StructField("created_at_utc", T.StringType(), False),
|
||||||
|
T.StructField("task_type", T.StringType(), False),
|
||||||
|
T.StructField("release_name", T.StringType(), True),
|
||||||
|
T.StructField("objective", T.StringType(), True),
|
||||||
|
T.StructField("step_id", T.StringType(), False),
|
||||||
|
T.StructField("step_title", T.StringType(), True),
|
||||||
|
T.StructField("action_type", T.StringType(), False),
|
||||||
|
T.StructField("requires_approval", T.BooleanType(), False),
|
||||||
|
T.StructField("approved", T.BooleanType(), False),
|
||||||
|
T.StructField("status", T.StringType(), False),
|
||||||
|
T.StructField("output_json", T.StringType(), True),
|
||||||
|
T.StructField("error_text", T.StringType(), True),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
row = [
|
||||||
|
(
|
||||||
|
args.action_id,
|
||||||
|
args.created_at_utc,
|
||||||
|
args.task_type,
|
||||||
|
args.release_name or "",
|
||||||
|
objective,
|
||||||
|
args.step_id,
|
||||||
|
step_title,
|
||||||
|
args.action_type,
|
||||||
|
requires_approval,
|
||||||
|
approved,
|
||||||
|
args.status,
|
||||||
|
output_json,
|
||||||
|
error_text,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
df = spark.createDataFrame(row, schema=schema)
|
||||||
|
df.writeTo(args.table).append()
|
||||||
|
print(f"[DONE] Recorded assistant action {args.action_id} into {args.table}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
103
write_assistant_feedback.py
Normal file
103
write_assistant_feedback.py
Normal file
|
|
@ -0,0 +1,103 @@
|
||||||
|
import argparse
|
||||||
|
import base64
|
||||||
|
import json
|
||||||
|
|
||||||
|
from pyspark.sql import SparkSession, types as T
|
||||||
|
|
||||||
|
|
||||||
|
def d(s: str) -> str:
|
||||||
|
if not s:
|
||||||
|
return ""
|
||||||
|
return base64.b64decode(s.encode("ascii")).decode("utf-8")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Write assistant feedback row via Spark DataFrame")
|
||||||
|
p.add_argument("--table", required=True)
|
||||||
|
p.add_argument("--feedback-id", required=True)
|
||||||
|
p.add_argument("--created-at-utc", required=True)
|
||||||
|
p.add_argument("--outcome", required=True)
|
||||||
|
p.add_argument("--task-type", required=True)
|
||||||
|
p.add_argument("--release-name", default="")
|
||||||
|
p.add_argument("--confidence", type=float, default=0.0)
|
||||||
|
p.add_argument("--needs-review", default="true")
|
||||||
|
p.add_argument("--goal-b64", default="")
|
||||||
|
p.add_argument("--draft-b64", default="")
|
||||||
|
p.add_argument("--final-b64", default="")
|
||||||
|
p.add_argument("--sources-b64", default="")
|
||||||
|
p.add_argument("--notes-b64", default="")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
needs_review = str(args.needs_review).lower() == "true"
|
||||||
|
goal = d(args.goal_b64)
|
||||||
|
draft_text = d(args.draft_b64)
|
||||||
|
final_text = d(args.final_b64)
|
||||||
|
sources_json = d(args.sources_b64)
|
||||||
|
notes = d(args.notes_b64)
|
||||||
|
if not sources_json:
|
||||||
|
sources_json = "[]"
|
||||||
|
# Validate JSON shape but keep raw string in table.
|
||||||
|
try:
|
||||||
|
json.loads(sources_json)
|
||||||
|
except Exception:
|
||||||
|
sources_json = "[]"
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("write-assistant-feedback").getOrCreate()
|
||||||
|
spark.sql(
|
||||||
|
f"""
|
||||||
|
CREATE TABLE IF NOT EXISTS {args.table} (
|
||||||
|
feedback_id STRING,
|
||||||
|
created_at_utc STRING,
|
||||||
|
outcome STRING,
|
||||||
|
task_type STRING,
|
||||||
|
release_name STRING,
|
||||||
|
confidence DOUBLE,
|
||||||
|
needs_review BOOLEAN,
|
||||||
|
goal STRING,
|
||||||
|
draft_text STRING,
|
||||||
|
final_text STRING,
|
||||||
|
sources_json STRING,
|
||||||
|
notes STRING
|
||||||
|
) USING iceberg
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
schema = T.StructType(
|
||||||
|
[
|
||||||
|
T.StructField("feedback_id", T.StringType(), False),
|
||||||
|
T.StructField("created_at_utc", T.StringType(), False),
|
||||||
|
T.StructField("outcome", T.StringType(), False),
|
||||||
|
T.StructField("task_type", T.StringType(), False),
|
||||||
|
T.StructField("release_name", T.StringType(), True),
|
||||||
|
T.StructField("confidence", T.DoubleType(), True),
|
||||||
|
T.StructField("needs_review", T.BooleanType(), False),
|
||||||
|
T.StructField("goal", T.StringType(), True),
|
||||||
|
T.StructField("draft_text", T.StringType(), True),
|
||||||
|
T.StructField("final_text", T.StringType(), True),
|
||||||
|
T.StructField("sources_json", T.StringType(), True),
|
||||||
|
T.StructField("notes", T.StringType(), True),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
row = [
|
||||||
|
(
|
||||||
|
args.feedback_id,
|
||||||
|
args.created_at_utc,
|
||||||
|
args.outcome,
|
||||||
|
args.task_type,
|
||||||
|
args.release_name or "",
|
||||||
|
float(args.confidence),
|
||||||
|
needs_review,
|
||||||
|
goal,
|
||||||
|
draft_text,
|
||||||
|
final_text,
|
||||||
|
sources_json,
|
||||||
|
notes,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
df = spark.createDataFrame(row, schema=schema)
|
||||||
|
df.writeTo(args.table).append()
|
||||||
|
print(f"[DONE] Recorded assistant feedback {args.feedback_id} into {args.table}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Loading…
Reference in a new issue