chore: bootstrap assistant platform baseline

This commit is contained in:
Carl Niklas Rydberg 2026-02-14 21:10:26 +01:00
commit 912f8ebc56
38 changed files with 6302 additions and 0 deletions

28
.gitignore vendored Normal file
View file

@ -0,0 +1,28 @@
# Python
__pycache__/
*.py[cod]
*.so
.venv/
venv/
# Env/secrets
.env
.env.*
*.key
*.pem
# Local/runtime
logs/
*.log
runs.db
# OS/editor
.DS_Store
.vscode/
.idea/
# Build/temp
build/
dist/
.tmp/
tmp/

27
MESSAGES_RELEASE_FLOW.md Normal file
View file

@ -0,0 +1,27 @@
# Messages Release Flow
This flow creates a Nessie tag for `lake.db1.messages`, generates a manifest JSON, and appends a row to `lake.db1.releases_v2`.
## Run on lakehouse-core
```bash
ssh niklas@lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./create-messages-release-via-spark-container.sh'
```
## Custom release name
```bash
ssh niklas@lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./create-messages-release-via-spark-container.sh rel_2026-02-14_messages-v1'
```
## Outputs
- Manifest file written to `./manifests/<release_name>.json`
- Nessie tag `<release_name>` created at current `main` hash (or reused if already present)
- Registry row appended to `lake.db1.releases_v2`
## Verify
```bash
ssh niklas@lakehouse-core.rakeroots.lan "docker exec spark /opt/spark/bin/spark-sql --properties-file /opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf --packages 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5' -e \"SELECT release_name, table_identifier, snapshot_id, created_at_utc FROM lake.db1.releases_v2 WHERE table_identifier='lake.db1.messages' ORDER BY created_at_utc DESC LIMIT 10\""
```

23
MESSAGES_SCHEMA.md Normal file
View file

@ -0,0 +1,23 @@
# Messages Schema
Creates Iceberg table `lake.db1.messages` with ingest fields:
- `thread_id` STRING
- `message_id` STRING
- `sender` STRING
- `channel` STRING
- `sent_at` TIMESTAMP
- `body` STRING
- `metadata_json` STRING
## Run on lakehouse-core
```bash
ssh niklas@lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./create-messages-table-via-spark-container.sh'
```
## Verify
```bash
ssh niklas@lakehouse-core.rakeroots.lan "docker exec spark /opt/spark/bin/spark-sql --properties-file /opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf --packages 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5' -e 'DESCRIBE TABLE lake.db1.messages'"
```

142
PROJECTOR_USAGE.md Normal file
View file

@ -0,0 +1,142 @@
# Release Projector
`release_projector.py` rebuilds serving projections (JanusGraph + Elasticsearch) from a lakehouse release manifest.
## What it does
1. Loads a release manifest JSON (or a `releases_v2` row containing `manifest_json`).
2. Resolves Nessie tag/ref from the manifest (or `--nessie-ref`).
3. Reads the concept Iceberg table from that ref through Spark + Iceberg + Nessie.
4. Upserts each concept into JanusGraph and Elasticsearch.
`release_projector.py` now accepts both concept-shaped rows and document-shaped rows.
For docs tables, it auto-detects typical columns:
- name: `canonical_name|title|name|subject`
- id: `concept_id|doc_id|document_id|id|uuid`
- summary text: `summary|description|abstract|content|text|body`
## Prerequisites
- Python deps: `python-dotenv`, `httpx`, `gremlinpython`, `pyspark`
- Spark/Iceberg/Nessie jars (default package coordinates are baked into script)
- Network access to:
- Nessie API (example: `http://lakehouse-core:19120/api/v2`)
- MinIO S3 endpoint (example: `http://lakehouse-core:9000`)
- JanusGraph Gremlin endpoint
- Elasticsearch endpoint
## Recommended isolated env
Do not install projector dependencies into system Python.
## Preferred: existing spark container on lakehouse-core
This reuses your existing `spark` container and Spark properties file.
Standard command (frozen):
```bash
./run-projector-standard.sh
```
Run by release name (no manifest path):
```bash
./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
```
Standard dry-run:
```bash
./run-projector-standard.sh --dry-run
```
Copy files to host:
```bash
rsync -av --delete /home/niklas/projects/jecio/ lakehouse-core.rakeroots.lan:/tmp/jecio/
```
Run dry-run projection inside `spark` container:
```bash
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-via-spark-container.sh ./manifests/rel_2026-02-14_docs-v1.json lake.db1.docs --dry-run es'
```
Run publish projection (writes Janus/ES):
```bash
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-standard.sh'
```
`run-projector-via-spark-container.sh` uses:
- container: `spark` (override with `SPARK_CONTAINER_NAME`)
- properties file: `/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf` (override with `SPARK_PROPS`)
- Spark packages: Iceberg + Nessie extensions (override with `SPARK_PACKAGES`)
- arg4 `targets`: `es|gremlin|both` (default `both`)
- arg5 `release_name`: optional; if set, loads manifest from `releases_v2`
Direct projector usage:
```bash
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets both
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets both
```
Local setup (fallback):
```bash
./setup_local_env.sh .venv-projector
source .venv-projector/bin/activate
```
Remote setup (fallback, venv on `lakehouse-core`):
```bash
scp release_projector.py requirements-projector.txt manifests/rel_2026-02-14_docs-v1.json lakehouse-core.rakeroots.lan:/tmp/
ssh lakehouse-core.rakeroots.lan 'python3 -m venv /tmp/jecio-projector-venv && /tmp/jecio-projector-venv/bin/pip install --upgrade pip && /tmp/jecio-projector-venv/bin/pip install -r /tmp/requirements-projector.txt'
```
## Required env vars (example)
```bash
export NESSIE_URI=http://lakehouse-core:19120/api/v2
export NESSIE_WAREHOUSE=s3a://lakehouse/warehouse
export S3_ENDPOINT=http://lakehouse-core:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin
export ES_URL=http://janus.rakeroots.lan:9200
export ES_INDEX=concepts
```
## Run
```bash
/tmp/jecio-projector-venv/bin/python /tmp/release_projector.py \
--manifest-file /tmp/rel_2026-02-14_docs-v1.json \
--concept-table lake.db1.docs \
--dry-run
```
Or local:
```bash
python3 release_projector.py \
--manifest-file /path/to/release.json \
--concept-table lake.db1.concepts
```
If the manifest has a Nessie tag in fields like `nessie.tag`, you can omit `--nessie-ref`.
Dry run:
```bash
python3 release_projector.py \
--manifest-file /path/to/release.json \
--concept-table lake.db1.concepts \
--dry-run
```

3227
app.py Normal file

File diff suppressed because it is too large Load diff

126
connectivity_check.py Normal file
View file

@ -0,0 +1,126 @@
import os
import sys
import json
import requests
from dotenv import load_dotenv
# Optional: only needed for Gremlin websocket test
try:
import websocket
HAS_WEBSOCKET = True
except ImportError:
HAS_WEBSOCKET = False
def ok(msg):
print(f"[ OK ] {msg}")
def fail(msg):
print(f"[FAIL] {msg}")
def load_env():
load_dotenv()
ok("Loaded .env file")
def test_http(name, url, path="", method="GET", json_body=None):
full_url = url.rstrip("/") + path
try:
resp = requests.request(
method,
full_url,
json=json_body,
timeout=5,
)
if resp.status_code < 400:
ok(f"{name} reachable ({resp.status_code}) → {full_url}")
return True
else:
fail(f"{name} error ({resp.status_code}) → {full_url}")
except Exception as e:
fail(f"{name} unreachable → {full_url} ({e})")
return False
def test_gremlin_ws(url):
if not HAS_WEBSOCKET:
fail("Gremlin test skipped (websocket-client not installed)")
return False
try:
ws = websocket.create_connection(url, timeout=5)
ws.close()
ok(f"Gremlin websocket reachable → {url}")
return True
except Exception as e:
fail(f"Gremlin websocket unreachable → {url} ({e})")
return False
def main():
load_env()
GREMLIN_URL = os.getenv("GREMLIN_URL", "ws://localhost:8182/gremlin")
ES_URL = os.getenv("ES_URL", "http://localhost:9200")
ES_INDEX = os.getenv("ES_INDEX", "concepts")
IPFS_API = os.getenv("IPFS_API", "http://localhost:5001")
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3.1:8b")
OLLAMA_EMBED_MODEL = os.getenv("OLLAMA_EMBED_MODEL", "nomic-embed-text")
print("\n=== Connectivity checks ===\n")
# Gremlin
test_gremlin_ws(GREMLIN_URL)
# Elasticsearch root
test_http("Elasticsearch", ES_URL)
# Elasticsearch index existence
test_http(
"Elasticsearch index",
ES_URL,
path=f"/{ES_INDEX}",
method="HEAD",
)
# IPFS (Kubo)
test_http(
"IPFS API",
IPFS_API,
path="/api/v0/version",
method="POST",
)
# Ollama base
test_http(
"Ollama",
OLLAMA_URL,
path="/api/tags",
)
# Ollama model availability (best-effort)
try:
resp = requests.get(f"{OLLAMA_URL}/api/tags", timeout=5)
models = [m["name"] for m in resp.json().get("models", [])]
if OLLAMA_MODEL in models:
ok(f"Ollama model available → {OLLAMA_MODEL}")
else:
fail(f"Ollama model NOT found → {OLLAMA_MODEL}")
if OLLAMA_EMBED_MODEL in models:
ok(f"Ollama embed model available → {OLLAMA_EMBED_MODEL}")
else:
fail(f"Ollama embed model NOT found → {OLLAMA_EMBED_MODEL}")
except Exception as e:
fail(f"Ollama model check failed ({e})")
print("\n=== Done ===\n")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,47 @@
#!/usr/bin/env bash
set -euo pipefail
RELEASE_NAME="${1:-rel_$(date -u +%Y-%m-%d)_messages-v1}"
TABLE="${MESSAGES_TABLE:-lake.db1.messages}"
MANIFEST_LOCAL="${2:-./manifests/${RELEASE_NAME}.json}"
DESCRIPTION="${RELEASE_DESCRIPTION:-Messages release for ${TABLE}}"
CREATED_BY="${RELEASE_CREATED_BY:-${USER:-unknown}}"
NESSIE_URI="${NESSIE_URI:-http://nessie:19120/api/v2}"
RELEASES_TABLE="${RELEASES_TABLE:-lake.db1.releases_v2}"
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./create_release_manifest.py}"
SCRIPT_REMOTE="/tmp/create_release_manifest.py"
MANIFEST_REMOTE="/tmp/${RELEASE_NAME}.json"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "create_release_manifest.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
mkdir -p "$(dirname "$MANIFEST_LOCAL")"
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--release-name "$RELEASE_NAME" \
--table "$TABLE" \
--nessie-uri "$NESSIE_URI" \
--manifest-out "$MANIFEST_REMOTE" \
--description "$DESCRIPTION" \
--created-by "$CREATED_BY" \
--releases-table "$RELEASES_TABLE"
docker cp "$CONTAINER_NAME":"$MANIFEST_REMOTE" "$MANIFEST_LOCAL"
echo "[DONE] Saved manifest: $MANIFEST_LOCAL"

View file

@ -0,0 +1,35 @@
#!/usr/bin/env bash
set -euo pipefail
# Creates Iceberg table for assistant message ingest.
# Default table: lake.db1.messages
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
MESSAGES_TABLE="${MESSAGES_TABLE:-lake.db1.messages}"
SQL="
CREATE NAMESPACE IF NOT EXISTS lake.db1;
CREATE TABLE IF NOT EXISTS ${MESSAGES_TABLE} (
thread_id STRING,
message_id STRING,
sender STRING,
channel STRING,
sent_at TIMESTAMP,
body STRING,
metadata_json STRING
)
USING iceberg
PARTITIONED BY (days(sent_at));
"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-sql \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
-e "$SQL"

279
create_release_manifest.py Normal file
View file

@ -0,0 +1,279 @@
import argparse
import hashlib
import json
import os
import urllib.error
import urllib.parse
import urllib.request
from datetime import datetime, timezone
from pyspark.sql import SparkSession
from pyspark.sql import types as T
def now_iso() -> str:
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace('+00:00', 'Z')
def http_json(method: str, url: str, payload: dict | None = None) -> dict:
data = json.dumps(payload).encode("utf-8") if payload is not None else None
req = urllib.request.Request(url, data=data, method=method)
req.add_header("Content-Type", "application/json")
with urllib.request.urlopen(req, timeout=30) as resp:
body = resp.read().decode("utf-8")
return json.loads(body) if body else {}
def get_ref(nessie_uri: str, ref_name: str) -> dict | None:
try:
return http_json("GET", f"{nessie_uri.rstrip('/')}/trees/{urllib.parse.quote(ref_name, safe='')}")
except urllib.error.HTTPError as e:
if e.code == 404:
return None
raise
def extract_ref_hash(ref_obj: dict) -> str:
# Nessie responses can vary by endpoint/version:
# - {"type":"BRANCH","name":"main","hash":"..."}
# - {"reference":{"type":"BRANCH","name":"main","hash":"..."}}
if isinstance(ref_obj.get("hash"), str) and ref_obj["hash"]:
return ref_obj["hash"]
reference = ref_obj.get("reference")
if isinstance(reference, dict) and isinstance(reference.get("hash"), str) and reference["hash"]:
return reference["hash"]
raise KeyError("hash")
def ensure_tag(nessie_uri: str, tag_name: str) -> dict:
existing = get_ref(nessie_uri, tag_name)
if existing is not None:
return existing
main_ref = http_json("GET", f"{nessie_uri.rstrip('/')}/trees/main")
payload = {
"type": "BRANCH",
"name": "main",
"hash": extract_ref_hash(main_ref),
}
query = urllib.parse.urlencode({"name": tag_name, "type": "TAG"})
http_json("POST", f"{nessie_uri.rstrip('/')}/trees?{query}", payload)
created = get_ref(nessie_uri, tag_name)
if created is None:
raise RuntimeError(f"Tag creation appeared to succeed but tag '{tag_name}' is not retrievable")
return created
def create_registry_table_if_missing(spark: SparkSession, releases_table: str) -> None:
spark.sql(
f"""
CREATE TABLE IF NOT EXISTS {releases_table} (
release_name STRING,
ref_type STRING,
ref_name STRING,
ref_hash STRING,
created_at_utc STRING,
ingested_at_utc STRING,
table_identifier STRING,
snapshot_id BIGINT,
metadata_location STRING,
manifest_sha256 STRING,
manifest_json STRING
) USING iceberg
"""
)
def _to_utc_datetime(value: str):
# Accept ISO strings with 'Z' suffix.
return datetime.fromisoformat(value.replace("Z", "+00:00")).astimezone(timezone.utc)
def _convert_value_for_type(field: T.StructField, value):
if value is None:
return None
dt = field.dataType
if isinstance(dt, T.StringType):
return str(value)
if isinstance(dt, T.LongType):
return int(value)
if isinstance(dt, T.IntegerType):
return int(value)
if isinstance(dt, T.ShortType):
return int(value)
if isinstance(dt, T.ByteType):
return int(value)
if isinstance(dt, T.BooleanType):
return bool(value)
if isinstance(dt, T.FloatType):
return float(value)
if isinstance(dt, T.DoubleType):
return float(value)
if isinstance(dt, T.TimestampType):
if isinstance(value, datetime):
return value
return _to_utc_datetime(str(value))
if isinstance(dt, T.DateType):
if isinstance(value, datetime):
return value.date()
return _to_utc_datetime(str(value)).date()
# Leave unsupported/complex types as-is; Spark can still validate and fail clearly.
return value
def append_registry_row(
spark: SparkSession,
releases_table: str,
release_name: str,
ref_type: str,
ref_name: str,
ref_hash: str,
created_at_utc: str,
ingested_at_utc: str,
table_identifier: str,
snapshot_id: int,
metadata_location: str,
manifest_sha256: str,
manifest_json: str,
created_by: str,
description: str,
) -> None:
target_schema = spark.table(releases_table).schema
base_values = {
"release_name": release_name,
"ref_type": ref_type,
"ref_name": ref_name,
"ref_hash": ref_hash,
"created_at_utc": created_at_utc,
"ingested_at_utc": ingested_at_utc,
"table_identifier": table_identifier,
"snapshot_id": int(snapshot_id),
"metadata_location": metadata_location,
"manifest_sha256": manifest_sha256,
"manifest_json": manifest_json,
"created_by": created_by,
"description": description,
"release_description": description,
}
row_values = []
missing_required = []
for field in target_schema.fields:
name = field.name
if name in base_values:
value = _convert_value_for_type(field, base_values[name])
row_values.append(value)
continue
if field.nullable:
row_values.append(None)
continue
missing_required.append(name)
if missing_required:
raise RuntimeError(
"Cannot append to registry table "
f"{releases_table}. Missing required columns with no known mapping: {', '.join(missing_required)}"
)
df = spark.createDataFrame([tuple(row_values)], schema=target_schema)
df.writeTo(releases_table).append()
def main() -> None:
p = argparse.ArgumentParser(description="Create a release tag + manifest + registry row for a table.")
p.add_argument("--release-name", required=True)
p.add_argument("--table", default="lake.db1.messages")
p.add_argument("--nessie-uri", default=os.getenv("NESSIE_URI", "http://nessie:19120/api/v2"))
p.add_argument("--manifest-out", required=True)
p.add_argument("--description", default="Messages release")
p.add_argument("--created-by", default=os.getenv("USER", "unknown"))
p.add_argument("--releases-table", default=os.getenv("RELEASES_TABLE", "lake.db1.releases_v2"))
p.add_argument("--skip-registry", action="store_true")
args = p.parse_args()
created_at = now_iso()
tag_ref = ensure_tag(args.nessie_uri, args.release_name)
ref_hash = extract_ref_hash(tag_ref)
spark = SparkSession.builder.appName("create-release-manifest").getOrCreate()
snap_row = spark.sql(
f"SELECT snapshot_id FROM {args.table}.snapshots ORDER BY committed_at DESC LIMIT 1"
).collect()
if not snap_row:
raise RuntimeError(f"No snapshots found for table {args.table}")
snapshot_id = int(snap_row[0]["snapshot_id"])
meta_row = spark.sql(
f"SELECT file AS metadata_location FROM {args.table}.metadata_log_entries ORDER BY timestamp DESC LIMIT 1"
).collect()
if not meta_row:
raise RuntimeError(f"No metadata log entries found for table {args.table}")
metadata_location = str(meta_row[0]["metadata_location"])
manifest = {
"schema_version": "lakehouse-release-manifest/v1",
"release": {
"name": args.release_name,
"created_at_utc": created_at,
"created_by": args.created_by,
"description": args.description,
},
"nessie": {
"uri": args.nessie_uri,
"ref": {
"type": "tag",
"name": args.release_name,
"hash": ref_hash,
},
},
"tables": [
{
"identifier": args.table,
"format": "iceberg",
"current_snapshot_id": snapshot_id,
"metadata_location": metadata_location,
}
],
}
manifest_json = json.dumps(manifest, ensure_ascii=False, indent=2)
manifest_sha256 = hashlib.sha256(manifest_json.encode("utf-8")).hexdigest()
os.makedirs(os.path.dirname(args.manifest_out) or ".", exist_ok=True)
with open(args.manifest_out, "w", encoding="utf-8") as f:
f.write(manifest_json)
if not args.skip_registry:
create_registry_table_if_missing(spark, args.releases_table)
append_registry_row(
spark=spark,
releases_table=args.releases_table,
release_name=args.release_name,
ref_type="tag",
ref_name=args.release_name,
ref_hash=ref_hash,
created_at_utc=created_at,
ingested_at_utc=now_iso(),
table_identifier=args.table,
snapshot_id=snapshot_id,
metadata_location=metadata_location,
manifest_sha256=manifest_sha256,
manifest_json=manifest_json,
created_by=args.created_by,
description=args.description,
)
print(f"[INFO] release_name={args.release_name}")
print(f"[INFO] table={args.table}")
print(f"[INFO] ref_hash={ref_hash}")
print(f"[INFO] snapshot_id={snapshot_id}")
print(f"[INFO] manifest_out={args.manifest_out}")
if args.skip_registry:
print("[INFO] registry=skipped")
else:
print(f"[INFO] registry_table={args.releases_table}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,21 @@
FROM python:3.11-slim
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
SPARK_LOCAL_HOSTNAME=localhost \
SPARK_LOCAL_IP=127.0.0.1
RUN apt-get update \
&& apt-get install -y --no-install-recommends default-jre-headless ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements-projector.txt /app/requirements-projector.txt
RUN pip install --upgrade pip && pip install -r /app/requirements-projector.txt
COPY release_projector.py /app/release_projector.py
ENTRYPOINT ["python", "/app/release_projector.py"]

View file

@ -0,0 +1,41 @@
# Projector Container
Build on `lakehouse-core`:
```bash
docker build -t jecio/release-projector:0.1 -f docker/projector/Dockerfile /tmp/jecio
```
Dry-run:
```bash
docker run --rm --network host \
-e NESSIE_URI=http://lakehouse-core:19120/api/v2 \
-e NESSIE_WAREHOUSE=s3a://lakehouse/warehouse \
-e S3_ENDPOINT=http://lakehouse-core:9000 \
-e AWS_ACCESS_KEY_ID=minioadmin \
-e AWS_SECRET_ACCESS_KEY=minioadmin \
-v /tmp:/work \
jecio/release-projector:0.1 \
--manifest-file /work/rel_2026-02-14_docs-v1.json \
--concept-table lake.db1.docs \
--dry-run
```
Publish projection:
```bash
docker run --rm --network host \
-e NESSIE_URI=http://lakehouse-core:19120/api/v2 \
-e NESSIE_WAREHOUSE=s3a://lakehouse/warehouse \
-e S3_ENDPOINT=http://lakehouse-core:9000 \
-e AWS_ACCESS_KEY_ID=minioadmin \
-e AWS_SECRET_ACCESS_KEY=minioadmin \
-e GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin \
-e ES_URL=http://janus.rakeroots.lan:9200 \
-e ES_INDEX=concepts \
-v /tmp:/work \
jecio/release-projector:0.1 \
--manifest-file /work/rel_2026-02-14_docs-v1.json \
--concept-table lake.db1.docs
```

View file

@ -0,0 +1,56 @@
#!/usr/bin/env bash
set -euo pipefail
TABLE="${1:-lake.db1.messages}"
THREAD_ID="${2:-}"
MESSAGE_ID="${3:-}"
SENDER="${4:-}"
CHANNEL="${5:-}"
SENT_AT="${6:-}"
BODY_B64="${7:-}"
METADATA_B64="${8:-}"
if [[ -z "$THREAD_ID" || -z "$MESSAGE_ID" || -z "$SENDER" || -z "$CHANNEL" || -z "$BODY_B64" ]]; then
echo "Usage: $0 <table> <thread_id> <message_id> <sender> <channel> <sent_at_or_empty> <body_b64> <metadata_json_b64>" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
BODY="$(printf '%s' "$BODY_B64" | base64 -d)"
METADATA_JSON="{}"
if [[ -n "$METADATA_B64" ]]; then
METADATA_JSON="$(printf '%s' "$METADATA_B64" | base64 -d)"
fi
sql_escape() {
printf "%s" "$1" | sed "s/'/''/g"
}
THREAD_ID_ESC="$(sql_escape "$THREAD_ID")"
MESSAGE_ID_ESC="$(sql_escape "$MESSAGE_ID")"
SENDER_ESC="$(sql_escape "$SENDER")"
CHANNEL_ESC="$(sql_escape "$CHANNEL")"
BODY_ESC="$(sql_escape "$BODY")"
METADATA_ESC="$(sql_escape "$METADATA_JSON")"
if [[ -n "$SENT_AT" ]]; then
SENT_AT_EXPR="TIMESTAMP '$(sql_escape "$SENT_AT")'"
else
SENT_AT_EXPR="current_timestamp()"
fi
SQL="INSERT INTO ${TABLE} (thread_id, message_id, sender, channel, sent_at, body, metadata_json) VALUES ('${THREAD_ID_ESC}', '${MESSAGE_ID_ESC}', '${SENDER_ESC}', '${CHANNEL_ESC}', ${SENT_AT_EXPR}, '${BODY_ESC}', '${METADATA_ESC}')"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-sql \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
-e "$SQL"
echo "[DONE] Inserted message_id=${MESSAGE_ID} thread_id=${THREAD_ID} into ${TABLE}"

View file

@ -0,0 +1,60 @@
#!/usr/bin/env bash
set -euo pipefail
TABLE="${1:-lake.db1.messages}"
DEDUPE_MODE="${2:-none}"
PAYLOAD_B64="${3:-}"
if [[ -z "$PAYLOAD_B64" ]]; then
echo "Usage: $0 <table> <dedupe_mode:none|message_id|thread_message> <payload_b64_json_array|@/path/to/payload.json>" >&2
exit 1
fi
if [[ "$DEDUPE_MODE" != "none" && "$DEDUPE_MODE" != "message_id" && "$DEDUPE_MODE" != "thread_message" ]]; then
echo "Invalid dedupe_mode: $DEDUPE_MODE (expected none|message_id|thread_message)" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./ingest_messages_batch.py}"
SCRIPT_REMOTE="/tmp/ingest_messages_batch.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "ingest_messages_batch.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
SPARK_ARGS=(
--table "$TABLE"
--dedupe-mode "$DEDUPE_MODE"
)
if [[ "${PAYLOAD_B64:0:1}" == "@" ]]; then
PAYLOAD_FILE_HOST="${PAYLOAD_B64:1}"
if [[ ! -f "$PAYLOAD_FILE_HOST" ]]; then
echo "Payload file not found: $PAYLOAD_FILE_HOST" >&2
exit 1
fi
PAYLOAD_FILE_REMOTE="/opt/spark/work-dir/ingest_messages_payload.json"
docker cp "$PAYLOAD_FILE_HOST" "$CONTAINER_NAME":"$PAYLOAD_FILE_REMOTE"
# Ensure spark user can read the file regardless of ownership from docker cp.
docker exec -u 0 "$CONTAINER_NAME" /bin/sh -lc "chmod 644 '$PAYLOAD_FILE_REMOTE' || true"
SPARK_ARGS+=(--payload-file "$PAYLOAD_FILE_REMOTE")
else
SPARK_ARGS+=(--payload-b64 "$PAYLOAD_B64")
fi
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
"${SPARK_ARGS[@]}"

139
ingest_messages_batch.py Normal file
View file

@ -0,0 +1,139 @@
import argparse
import base64
import json
from datetime import datetime, timezone
from typing import Any, Dict, List
from pyspark.sql import SparkSession, types as T
def now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def decode_payload(payload_b64: str) -> List[Dict[str, Any]]:
raw = base64.b64decode(payload_b64.encode("ascii")).decode("utf-8")
data = json.loads(raw)
if not isinstance(data, list):
raise ValueError("Payload must decode to a JSON array")
out: List[Dict[str, Any]] = []
for i, row in enumerate(data):
if not isinstance(row, dict):
raise ValueError(f"Row {i} must be a JSON object")
out.append(row)
return out
def normalize_rows(rows: List[Dict[str, Any]]) -> List[tuple]:
norm: List[tuple] = []
for i, r in enumerate(rows):
thread_id = str(r.get("thread_id") or "").strip()
message_id = str(r.get("message_id") or "").strip()
sender = str(r.get("sender") or "").strip()
channel = str(r.get("channel") or "").strip()
body = str(r.get("body") or "").strip()
if not thread_id or not message_id or not sender or not channel or not body:
raise ValueError(
f"Row {i} missing required fields. "
"Required: thread_id, message_id, sender, channel, body"
)
sent_at_raw = r.get("sent_at")
sent_at = str(sent_at_raw).strip() if sent_at_raw is not None else ""
metadata = r.get("metadata", {})
if not isinstance(metadata, dict):
metadata = {}
metadata_json = json.dumps(metadata, ensure_ascii=False, sort_keys=True)
norm.append((thread_id, message_id, sender, channel, sent_at, body, metadata_json))
return norm
def main() -> None:
p = argparse.ArgumentParser(description="Batch ingest messages into Iceberg table")
p.add_argument("--table", required=True)
p.add_argument(
"--dedupe-mode",
choices=["none", "message_id", "thread_message"],
default="none",
help="Optional dedupe strategy against existing target rows",
)
p.add_argument("--payload-b64")
p.add_argument("--payload-file")
args = p.parse_args()
if not args.payload_b64 and not args.payload_file:
raise ValueError("Provide either --payload-b64 or --payload-file")
if args.payload_b64 and args.payload_file:
raise ValueError("Provide only one of --payload-b64 or --payload-file")
if args.payload_file:
with open(args.payload_file, "r", encoding="utf-8") as f:
file_data = json.load(f)
if not isinstance(file_data, list):
raise ValueError("--payload-file must contain a JSON array")
rows = normalize_rows(file_data)
else:
rows = normalize_rows(decode_payload(args.payload_b64 or ""))
if not rows:
print("[INFO] No rows supplied; nothing to ingest.")
return
spark = SparkSession.builder.appName("ingest-messages-batch").getOrCreate()
schema = T.StructType(
[
T.StructField("thread_id", T.StringType(), False),
T.StructField("message_id", T.StringType(), False),
T.StructField("sender", T.StringType(), False),
T.StructField("channel", T.StringType(), False),
T.StructField("sent_at_raw", T.StringType(), True),
T.StructField("body", T.StringType(), False),
T.StructField("metadata_json", T.StringType(), False),
]
)
df = spark.createDataFrame(rows, schema=schema)
df.createOrReplaceTempView("_batch_messages")
base_select = """
SELECT
b.thread_id,
b.message_id,
b.sender,
b.channel,
CASE
WHEN b.sent_at_raw IS NULL OR TRIM(b.sent_at_raw) = '' THEN current_timestamp()
ELSE CAST(b.sent_at_raw AS TIMESTAMP)
END AS sent_at,
b.body,
b.metadata_json
FROM _batch_messages b
"""
if args.dedupe_mode == "none":
insert_select = base_select
elif args.dedupe_mode == "message_id":
insert_select = (
base_select
+ f" LEFT ANTI JOIN {args.table} t ON b.message_id = t.message_id"
)
else:
insert_select = (
base_select
+ f" LEFT ANTI JOIN {args.table} t ON b.thread_id = t.thread_id AND b.message_id = t.message_id"
)
spark.sql(
f"""
INSERT INTO {args.table} (thread_id, message_id, sender, channel, sent_at, body, metadata_json)
{insert_select}
"""
)
print(f"[INFO] rows_in={len(rows)}")
print(f"[INFO] dedupe_mode={args.dedupe_mode}")
print(f"[INFO] table={args.table}")
print(f"[INFO] ingested_at_utc={now_iso()}")
print(f"[DONE] Batch ingest finished for {args.table}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,42 @@
{
"schema_version": "lakehouse-release-manifest/v1",
"release": {
"name": "rel_2026-02-14_docs-v1",
"created_at_utc": "2026-02-14T09:48:38Z",
"created_by": "niklas",
"description": "First tagged release for lake.db1.docs"
},
"nessie": {
"uri": "http://lakehouse-core:19120/api/v2",
"ref": {
"type": "tag",
"name": "rel_2026-02-14_docs-v1",
"hash": "1b16b4c4f6e99d43a27a21712aab319c1840a415f36bc6bebb2c9d2a89f09ef0"
}
},
"warehouse": {
"bucket": "lakehouse",
"warehouse_path": "s3a://lakehouse/warehouse",
"s3_endpoint": "http://lakehouse-core:9000",
"region": "us-east-1"
},
"tables": [
{
"identifier": "lake.db1.docs",
"format": "iceberg",
"current_snapshot_id": 4212875880010474311,
"metadata_location": "s3a://lakehouse/warehouse/db1/docs_2693aab9-54ea-43a8-892b-a922fdfc063a/metadata/00001-64f23fb4-2cb3-45c5-9c20-e6c91c9d73ef.metadata.json"
}
],
"projection": {
"enabled": false,
"projection_id": null,
"targets": []
},
"artifacts": {
"ipfs": {
"pinned": false,
"cid": null
}
}
}

View file

@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -euo pipefail
STATUS="${1:-}"
TASK_TYPE="${2:-}"
RELEASE_NAME="${3:-}"
STEP_ID="${4:-}"
ACTION_TYPE="${5:-}"
LIMIT="${6:-50}"
ACTION_TABLE="${ACTION_TABLE:-lake.db1.assistant_actions}"
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_assistant_actions.py}"
SCRIPT_REMOTE="/tmp/query_assistant_actions.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "query_assistant_actions.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--table "$ACTION_TABLE" \
--status "$STATUS" \
--task-type "$TASK_TYPE" \
--release-name "$RELEASE_NAME" \
--step-id "$STEP_ID" \
--action-type "$ACTION_TYPE" \
--limit "$LIMIT"

View file

@ -0,0 +1,35 @@
#!/usr/bin/env bash
set -euo pipefail
OUTCOME="${1:-}"
TASK_TYPE="${2:-}"
RELEASE_NAME="${3:-}"
LIMIT="${4:-50}"
FEEDBACK_TABLE="${FEEDBACK_TABLE:-lake.db1.assistant_feedback}"
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_assistant_feedback.py}"
SCRIPT_REMOTE="/tmp/query_assistant_feedback.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "query_assistant_feedback.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--table "$FEEDBACK_TABLE" \
--outcome "$OUTCOME" \
--task-type "$TASK_TYPE" \
--release-name "$RELEASE_NAME" \
--limit "$LIMIT"

View file

@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -euo pipefail
TASK_TYPE="${1:-}"
RELEASE_NAME="${2:-}"
OUTCOME="${3:-}"
GROUP_BY="${4:-both}"
LIMIT="${5:-100}"
FEEDBACK_TABLE="${FEEDBACK_TABLE:-lake.db1.assistant_feedback}"
if [[ "$GROUP_BY" != "task_type" && "$GROUP_BY" != "release_name" && "$GROUP_BY" != "both" ]]; then
echo "Invalid group_by: $GROUP_BY (expected task_type|release_name|both)" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_assistant_metrics.py}"
SCRIPT_REMOTE="/tmp/query_assistant_metrics.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "query_assistant_metrics.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--table "$FEEDBACK_TABLE" \
--task-type "$TASK_TYPE" \
--release-name "$RELEASE_NAME" \
--outcome "$OUTCOME" \
--group-by "$GROUP_BY" \
--limit "$LIMIT"

View file

@ -0,0 +1,38 @@
#!/usr/bin/env bash
set -euo pipefail
HOST="${1:-}"
MAILBOX="${2:-}"
USERNAME="${3:-}"
TABLE="${4:-lake.db1.messages}"
if [[ -z "$HOST" || -z "$MAILBOX" || -z "$USERNAME" ]]; then
echo "Usage: $0 <host> <mailbox> <username> [table]" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./query_imap_checkpoint.py}"
SCRIPT_REMOTE="/tmp/query_imap_checkpoint.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "query_imap_checkpoint.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--table "$TABLE" \
--host "$HOST" \
--mailbox "$MAILBOX" \
--username "$USERNAME"

View file

@ -0,0 +1,45 @@
import argparse
import json
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def main() -> None:
p = argparse.ArgumentParser(description="Query assistant actions")
p.add_argument("--table", default=os.getenv("ACTION_TABLE", "lake.db1.assistant_actions"))
p.add_argument("--status", default="")
p.add_argument("--task-type", default="")
p.add_argument("--release-name", default="")
p.add_argument("--step-id", default="")
p.add_argument("--action-type", default="")
p.add_argument("--limit", type=int, default=50)
args = p.parse_args()
spark = SparkSession.builder.appName("query-assistant-actions").getOrCreate()
df = spark.table(args.table)
if args.status:
df = df.where(F.col("status") == args.status)
if args.task_type:
df = df.where(F.col("task_type") == args.task_type)
if args.release_name:
df = df.where(F.col("release_name") == args.release_name)
if args.step_id:
df = df.where(F.col("step_id") == args.step_id)
if args.action_type:
df = df.where(F.col("action_type") == args.action_type)
rows = (
df.orderBy(F.col("created_at_utc").desc_nulls_last())
.limit(max(1, min(args.limit, 500)))
.collect()
)
out = [r.asDict(recursive=True) for r in rows]
print(json.dumps(out, ensure_ascii=False))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,43 @@
import argparse
import json
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def main() -> None:
p = argparse.ArgumentParser(description="Query assistant feedback rows")
p.add_argument("--table", default=os.getenv("FEEDBACK_TABLE", "lake.db1.assistant_feedback"))
p.add_argument("--outcome", default="")
p.add_argument("--task-type", default="")
p.add_argument("--release-name", default="")
p.add_argument("--limit", type=int, default=50)
args = p.parse_args()
spark = SparkSession.builder.appName("query-assistant-feedback").getOrCreate()
df = spark.table(args.table)
if args.outcome:
df = df.where(F.col("outcome") == args.outcome)
if args.task_type:
df = df.where(F.col("task_type") == args.task_type)
if args.release_name:
df = df.where(F.col("release_name") == args.release_name)
rows = (
df.orderBy(F.col("created_at_utc").desc_nulls_last())
.limit(max(1, min(args.limit, 500)))
.collect()
)
out = []
for r in rows:
item = r.asDict(recursive=True)
out.append(item)
print(json.dumps(out, ensure_ascii=False))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,57 @@
import argparse
import json
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def main() -> None:
p = argparse.ArgumentParser(description="Query assistant feedback metrics")
p.add_argument("--table", default=os.getenv("FEEDBACK_TABLE", "lake.db1.assistant_feedback"))
p.add_argument("--task-type", default="")
p.add_argument("--release-name", default="")
p.add_argument("--outcome", default="")
p.add_argument("--group-by", choices=["task_type", "release_name", "both"], default="both")
p.add_argument("--limit", type=int, default=100)
args = p.parse_args()
spark = SparkSession.builder.appName("query-assistant-metrics").getOrCreate()
df = spark.table(args.table)
if args.task_type:
df = df.where(F.col("task_type") == args.task_type)
if args.release_name:
df = df.where(F.col("release_name") == args.release_name)
if args.outcome:
df = df.where(F.col("outcome") == args.outcome)
if args.group_by == "task_type":
group_cols = [F.col("task_type")]
elif args.group_by == "release_name":
group_cols = [F.col("release_name")]
else:
group_cols = [F.col("task_type"), F.col("release_name")]
agg = (
df.groupBy(*group_cols)
.agg(
F.count(F.lit(1)).alias("total"),
F.sum(F.when(F.col("outcome") == "accepted", F.lit(1)).otherwise(F.lit(0))).alias("accepted"),
F.sum(F.when(F.col("outcome") == "edited", F.lit(1)).otherwise(F.lit(0))).alias("edited"),
F.sum(F.when(F.col("outcome") == "rejected", F.lit(1)).otherwise(F.lit(0))).alias("rejected"),
F.avg(F.col("confidence")).alias("avg_confidence"),
)
.withColumn("accept_rate", F.when(F.col("total") > 0, F.col("accepted") / F.col("total")).otherwise(F.lit(0.0)))
.withColumn("edit_rate", F.when(F.col("total") > 0, F.col("edited") / F.col("total")).otherwise(F.lit(0.0)))
.withColumn("reject_rate", F.when(F.col("total") > 0, F.col("rejected") / F.col("total")).otherwise(F.lit(0.0)))
.orderBy(F.col("total").desc(), *[c.asc() for c in group_cols])
.limit(max(1, min(args.limit, 1000)))
)
rows = [r.asDict(recursive=True) for r in agg.collect()]
print(json.dumps(rows, ensure_ascii=False))
if __name__ == "__main__":
main()

43
query_imap_checkpoint.py Normal file
View file

@ -0,0 +1,43 @@
import argparse
import json
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def main() -> None:
p = argparse.ArgumentParser(description="Query latest IMAP UID checkpoint from messages table")
p.add_argument("--table", default=os.getenv("MESSAGES_TABLE", "lake.db1.messages"))
p.add_argument("--host", required=True)
p.add_argument("--mailbox", required=True)
p.add_argument("--username", required=True)
args = p.parse_args()
spark = SparkSession.builder.appName("query-imap-checkpoint").getOrCreate()
df = spark.table(args.table)
md = F.col("metadata_json")
uid_col = F.get_json_object(md, "$.imap_uid")
host_col = F.get_json_object(md, "$.host")
mailbox_col = F.get_json_object(md, "$.mailbox")
username_col = F.get_json_object(md, "$.username")
filtered = (
df.where(F.col("channel") == "email-imap")
.where(host_col == args.host)
.where(mailbox_col == args.mailbox)
.where((username_col == args.username) | username_col.isNull() | (username_col == ""))
.where(uid_col.isNotNull())
)
row = filtered.select(F.max(uid_col.cast("long")).alias("max_uid")).collect()
max_uid = None
if row and row[0]["max_uid"] is not None:
max_uid = int(row[0]["max_uid"])
print(json.dumps({"max_uid": max_uid}, ensure_ascii=False))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,60 @@
#!/usr/bin/env bash
set -euo pipefail
ACTION_TABLE="${ACTION_TABLE:-lake.db1.assistant_actions}"
ACTION_ID="${1:-}"
CREATED_AT_UTC="${2:-}"
TASK_TYPE="${3:-}"
RELEASE_NAME="${4:-}"
OBJECTIVE_B64="${5:-}"
STEP_ID="${6:-}"
STEP_TITLE_B64="${7:-}"
ACTION_TYPE="${8:-}"
REQUIRES_APPROVAL="${9:-false}"
APPROVED="${10:-false}"
STATUS="${11:-}"
OUTPUT_B64="${12:-}"
ERROR_B64="${13:-}"
if [[ -z "$ACTION_ID" || -z "$CREATED_AT_UTC" || -z "$TASK_TYPE" || -z "$STEP_ID" || -z "$ACTION_TYPE" || -z "$STATUS" ]]; then
echo "Usage: $0 <action_id> <created_at_utc> <task_type> <release_name> <objective_b64> <step_id> <step_title_b64> <action_type> <requires_approval> <approved> <status> <output_b64> <error_b64>" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./write_assistant_action.py}"
SCRIPT_REMOTE="/tmp/write_assistant_action.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "write_assistant_action.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--table "$ACTION_TABLE" \
--action-id "$ACTION_ID" \
--created-at-utc "$CREATED_AT_UTC" \
--task-type "$TASK_TYPE" \
--release-name "$RELEASE_NAME" \
--objective-b64 "$OBJECTIVE_B64" \
--step-id "$STEP_ID" \
--step-title-b64 "$STEP_TITLE_B64" \
--action-type "$ACTION_TYPE" \
--requires-approval "$REQUIRES_APPROVAL" \
--approved "$APPROVED" \
--status "$STATUS" \
--output-b64 "$OUTPUT_B64" \
--error-b64 "$ERROR_B64"
echo "[DONE] Recorded assistant action ${ACTION_ID} into ${ACTION_TABLE}"

View file

@ -0,0 +1,58 @@
#!/usr/bin/env bash
set -euo pipefail
FEEDBACK_TABLE="${FEEDBACK_TABLE:-lake.db1.assistant_feedback}"
FEEDBACK_ID="${1:-}"
CREATED_AT_UTC="${2:-}"
OUTCOME="${3:-}"
TASK_TYPE="${4:-}"
RELEASE_NAME="${5:-}"
CONFIDENCE="${6:-0}"
NEEDS_REVIEW="${7:-true}"
GOAL_B64="${8:-}"
DRAFT_B64="${9:-}"
FINAL_B64="${10:-}"
SOURCES_B64="${11:-}"
NOTES_B64="${12:-}"
if [[ -z "$FEEDBACK_ID" || -z "$CREATED_AT_UTC" || -z "$OUTCOME" || -z "$TASK_TYPE" ]]; then
echo "Usage: $0 <feedback_id> <created_at_utc> <outcome> <task_type> <release_name> <confidence> <needs_review> <goal_b64> <draft_b64> <final_b64> <sources_b64> <notes_b64>" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./write_assistant_feedback.py}"
SCRIPT_REMOTE="/tmp/write_assistant_feedback.py"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "write_assistant_feedback.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"$SCRIPT_REMOTE" \
--table "$FEEDBACK_TABLE" \
--feedback-id "$FEEDBACK_ID" \
--created-at-utc "$CREATED_AT_UTC" \
--outcome "$OUTCOME" \
--task-type "$TASK_TYPE" \
--release-name "$RELEASE_NAME" \
--confidence "$CONFIDENCE" \
--needs-review "$NEEDS_REVIEW" \
--goal-b64 "$GOAL_B64" \
--draft-b64 "$DRAFT_B64" \
--final-b64 "$FINAL_B64" \
--sources-b64 "$SOURCES_B64" \
--notes-b64 "$NOTES_B64"
echo "[DONE] Recorded assistant feedback ${FEEDBACK_ID} into ${FEEDBACK_TABLE}"

View file

@ -0,0 +1,67 @@
#!/usr/bin/env bash
set -euo pipefail
# Args:
# 1 run_id
# 2 event_type
# 3 event_at_utc
# 4 detail_json_b64
RUN_ID="${1:-}"
EVENT_TYPE="${2:-}"
EVENT_AT_UTC="${3:-}"
DETAIL_JSON_B64="${4:-}"
if [[ -z "$RUN_ID" || -z "$EVENT_TYPE" || -z "$EVENT_AT_UTC" ]]; then
echo "usage: $0 <run_id> <event_type> <event_at_utc> <detail_json_b64>" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
RUN_EVENTS_TABLE="${RUN_EVENTS_TABLE:-lake.db1.run_events}"
decode_b64() {
local s="$1"
if [[ -z "$s" ]]; then
printf ""
return
fi
printf '%s' "$s" | base64 -d
}
escape_sql() {
sed "s/'/''/g"
}
DETAIL_JSON="$(decode_b64 "$DETAIL_JSON_B64" | escape_sql)"
RUN_ID_ESC="$(printf '%s' "$RUN_ID" | escape_sql)"
EVENT_TYPE_ESC="$(printf '%s' "$EVENT_TYPE" | escape_sql)"
EVENT_AT_ESC="$(printf '%s' "$EVENT_AT_UTC" | escape_sql)"
SQL="
CREATE TABLE IF NOT EXISTS ${RUN_EVENTS_TABLE} (
run_id STRING,
event_type STRING,
event_at_utc STRING,
detail_json STRING,
ingested_at_utc STRING
) USING iceberg;
INSERT INTO ${RUN_EVENTS_TABLE} VALUES (
'${RUN_ID_ESC}',
'${EVENT_TYPE_ESC}',
'${EVENT_AT_ESC}',
'${DETAIL_JSON}',
'${EVENT_AT_ESC}'
);
"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-sql \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
-e "$SQL"

View file

@ -0,0 +1,92 @@
#!/usr/bin/env bash
set -euo pipefail
# Args:
# 1 run_id
# 2 run_type
# 3 status
# 4 started_at_utc
# 5 finished_at_utc (or empty)
# 6 actor
# 7 input_json_b64
# 8 output_json_b64
# 9 error_text_b64
RUN_ID="${1:-}"
RUN_TYPE="${2:-}"
STATUS="${3:-}"
STARTED_AT_UTC="${4:-}"
FINISHED_AT_UTC="${5:-}"
ACTOR="${6:-}"
INPUT_JSON_B64="${7:-}"
OUTPUT_JSON_B64="${8:-}"
ERROR_TEXT_B64="${9:-}"
if [[ -z "$RUN_ID" || -z "$RUN_TYPE" || -z "$STATUS" || -z "$STARTED_AT_UTC" ]]; then
echo "usage: $0 <run_id> <run_type> <status> <started_at_utc> <finished_at_utc> <actor> <input_json_b64> <output_json_b64> <error_text_b64>" >&2
exit 1
fi
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
RUNS_TABLE="${RUNS_TABLE:-lake.db1.runs}"
decode_b64() {
local s="$1"
if [[ -z "$s" ]]; then
printf ""
return
fi
printf '%s' "$s" | base64 -d
}
escape_sql() {
sed "s/'/''/g"
}
INPUT_JSON="$(decode_b64 "$INPUT_JSON_B64" | escape_sql)"
OUTPUT_JSON="$(decode_b64 "$OUTPUT_JSON_B64" | escape_sql)"
ERROR_TEXT="$(decode_b64 "$ERROR_TEXT_B64" | escape_sql)"
RUN_ID_ESC="$(printf '%s' "$RUN_ID" | escape_sql)"
RUN_TYPE_ESC="$(printf '%s' "$RUN_TYPE" | escape_sql)"
STATUS_ESC="$(printf '%s' "$STATUS" | escape_sql)"
STARTED_ESC="$(printf '%s' "$STARTED_AT_UTC" | escape_sql)"
FINISHED_ESC="$(printf '%s' "$FINISHED_AT_UTC" | escape_sql)"
ACTOR_ESC="$(printf '%s' "$ACTOR" | escape_sql)"
SQL="
CREATE TABLE IF NOT EXISTS ${RUNS_TABLE} (
run_id STRING,
run_type STRING,
status STRING,
started_at_utc STRING,
finished_at_utc STRING,
actor STRING,
input_json STRING,
output_json STRING,
error_text STRING,
ingested_at_utc STRING
) USING iceberg;
INSERT INTO ${RUNS_TABLE} VALUES (
'${RUN_ID_ESC}',
'${RUN_TYPE_ESC}',
'${STATUS_ESC}',
'${STARTED_ESC}',
'${FINISHED_ESC}',
'${ACTOR_ESC}',
'${INPUT_JSON}',
'${OUTPUT_JSON}',
'${ERROR_TEXT}',
'${STARTED_ESC}'
);
"
docker exec \
-e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-sql \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
-e "$SQL"

607
release_projector.py Normal file
View file

@ -0,0 +1,607 @@
import argparse
import hashlib
import json
import os
import urllib.error
import urllib.request
from datetime import date, datetime, timezone
from typing import Any, Dict, List, Optional
try:
from dotenv import load_dotenv
except Exception:
load_dotenv = None
DEFAULT_SPARK_PACKAGES = (
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,"
"org.apache.iceberg:iceberg-aws-bundle:1.10.1,"
"org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5"
)
def utc_now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def parse_json_maybe(value: Any, expected_type: type, fallback: Any) -> Any:
if value is None:
return fallback
if isinstance(value, expected_type):
return value
if isinstance(value, str):
try:
parsed = json.loads(value)
if isinstance(parsed, expected_type):
return parsed
except Exception:
return fallback
return fallback
def first_str(row: Dict[str, Any], keys: List[str]) -> Optional[str]:
for key in keys:
val = row.get(key)
if isinstance(val, str) and val.strip():
return val.strip()
return None
def to_iso(value: Any) -> Optional[str]:
if isinstance(value, datetime):
return value.isoformat()
if isinstance(value, date):
return datetime.combine(value, datetime.min.time(), timezone.utc).isoformat()
if isinstance(value, str) and value.strip():
return value.strip()
return None
def make_fingerprint(name: str, kind: Optional[str], external_ids: Dict[str, str]) -> str:
norm = (name or "").strip().lower()
kind_norm = (kind or "").strip().lower()
ext = "|".join(f"{k}:{v}".lower() for k, v in sorted(external_ids.items()))
raw = f"{norm}|{kind_norm}|{ext}"
return hashlib.sha256(raw.encode("utf-8")).hexdigest()
def load_manifest(path: str) -> Dict[str, Any]:
with open(path, "r", encoding="utf-8") as f:
raw = json.load(f)
if isinstance(raw, dict):
manifest_json = raw.get("manifest_json")
if isinstance(manifest_json, str):
try:
parsed = json.loads(manifest_json)
if isinstance(parsed, dict):
return parsed
except Exception:
pass
return raw
if isinstance(raw, list) and raw and isinstance(raw[0], dict):
manifest_json = raw[0].get("manifest_json")
if isinstance(manifest_json, str):
parsed = json.loads(manifest_json)
if isinstance(parsed, dict):
return parsed
raise ValueError("Manifest file must contain a manifest object or releases_v2 row with manifest_json.")
def infer_manifest_ref(manifest: Dict[str, Any]) -> Optional[str]:
nessie = manifest.get("nessie")
if isinstance(nessie, dict):
ref_obj = nessie.get("ref")
if isinstance(ref_obj, dict):
ref_name = ref_obj.get("name")
if isinstance(ref_name, str) and ref_name.strip():
return ref_name.strip()
tag = nessie.get("tag")
if isinstance(tag, str) and tag.strip():
return tag.strip()
release_obj = manifest.get("release")
if isinstance(release_obj, dict):
release_name = release_obj.get("name")
if isinstance(release_name, str) and release_name.strip():
return release_name.strip()
for key in ("nessie_tag", "tag", "release_name"):
val = manifest.get(key)
if isinstance(val, str) and val.strip():
return val.strip()
return None
def extract_table_identifiers(manifest: Dict[str, Any]) -> List[str]:
out: List[str] = []
tables = manifest.get("tables")
if isinstance(tables, list):
for t in tables:
if not isinstance(t, dict):
continue
ident = t.get("table_identifier") or t.get("identifier") or t.get("table")
if isinstance(ident, str) and ident.strip():
out.append(ident.strip())
if out:
return out
rows = manifest.get("rows")
if isinstance(rows, list):
for row in rows:
if not isinstance(row, dict):
continue
ident = row.get("table_identifier")
if isinstance(ident, str) and ident.strip():
out.append(ident.strip())
return out
def infer_concept_table(tables: List[str]) -> Optional[str]:
for t in tables:
lower = t.lower()
if "concept" in lower:
return t
return tables[0] if tables else None
def load_manifest_from_registry(
spark: Any,
catalog: str,
release_name: str,
releases_table: Optional[str] = None,
) -> Dict[str, Any]:
from pyspark.sql import functions as F
table = releases_table or os.getenv("RELEASES_TABLE", "db1.releases_v2")
if table.count(".") == 1:
table = f"{catalog}.{table}"
row = (
spark.table(table)
.where(F.col("release_name") == release_name)
.orderBy(F.col("ingested_at_utc").desc_nulls_last())
.select("manifest_json")
.limit(1)
.collect()
)
if not row:
raise ValueError(f"Release '{release_name}' not found in registry table {table}.")
manifest_json = row[0]["manifest_json"]
if not isinstance(manifest_json, str) or not manifest_json.strip():
raise ValueError(f"Release '{release_name}' has empty manifest_json in {table}.")
manifest = json.loads(manifest_json)
if not isinstance(manifest, dict):
raise ValueError(f"Release '{release_name}' manifest_json is not a JSON object.")
return manifest
def build_spark(ref: str):
try:
from pyspark.sql import SparkSession
except Exception as e:
raise RuntimeError(
"pyspark is not installed. Install it or run this with spark-submit."
) from e
catalog = os.getenv("SPARK_CATALOG", "lake")
builder = (
SparkSession.builder.appName("release-projector")
.config(
"spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,"
"org.projectnessie.spark.extensions.NessieSparkSessionExtensions",
)
.config("spark.jars.packages", os.getenv("SPARK_PACKAGES", DEFAULT_SPARK_PACKAGES))
.config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog")
.config(f"spark.sql.catalog.{catalog}.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
.config(f"spark.sql.catalog.{catalog}.uri", os.getenv("NESSIE_URI", "http://lakehouse-core:19120/api/v2"))
.config(f"spark.sql.catalog.{catalog}.ref", ref)
.config(
f"spark.sql.catalog.{catalog}.warehouse",
os.getenv("NESSIE_WAREHOUSE", "s3a://lakehouse/warehouse"),
)
.config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.hadoop.fs.s3a.endpoint", os.getenv("S3_ENDPOINT", "http://lakehouse-core:9000"))
.config("spark.hadoop.fs.s3a.path.style.access", os.getenv("S3_PATH_STYLE", "true"))
.config(
"spark.hadoop.fs.s3a.access.key",
os.getenv("AWS_ACCESS_KEY_ID", os.getenv("MINIO_ROOT_USER", "minioadmin")),
)
.config(
"spark.hadoop.fs.s3a.secret.key",
os.getenv("AWS_SECRET_ACCESS_KEY", os.getenv("MINIO_ROOT_PASSWORD", "minioadmin")),
)
)
spark_master = os.getenv("SPARK_MASTER")
if spark_master:
builder = builder.master(spark_master)
return builder.getOrCreate(), catalog
def ensure_es_index(es_url: str, es_index: str) -> None:
mapping = {
"mappings": {
"properties": {
"concept_id": {"type": "keyword"},
"concept_type": {"type": "keyword"},
"display_name": {"type": "text"},
"description": {"type": "text"},
"text": {"type": "text"},
"source_table": {"type": "keyword"},
"source_pk": {"type": "keyword"},
"release_name": {"type": "keyword"},
"ref_hash": {"type": "keyword"},
"attributes_json": {"type": "text"},
"canonical_name": {"type": "text"},
"kind": {"type": "keyword"},
"aliases": {"type": "text"},
"tags": {"type": "keyword"},
"summary": {"type": "text"},
"latest_cid": {"type": "keyword"},
"fingerprint": {"type": "keyword"},
"created_at": {"type": "date"},
"updated_at": {"type": "date"},
}
}
}
url = f"{es_url.rstrip('/')}/{es_index}"
req_get = urllib.request.Request(url, method="GET")
try:
with urllib.request.urlopen(req_get, timeout=30) as resp:
if 200 <= resp.status < 300:
return
except urllib.error.HTTPError as e:
if e.code != 404:
raise
body = json.dumps(mapping).encode("utf-8")
req_put = urllib.request.Request(url, data=body, method="PUT")
req_put.add_header("Content-Type", "application/json")
with urllib.request.urlopen(req_put, timeout=30) as resp:
if resp.status >= 400:
raise RuntimeError(f"Failed to create ES index {es_index}: HTTP {resp.status}")
def es_upsert(es_url: str, es_index: str, doc: Dict[str, Any]) -> None:
url = f"{es_url.rstrip('/')}/{es_index}/_doc/{doc['concept_id']}"
body = json.dumps(doc, default=str).encode("utf-8")
req = urllib.request.Request(url, data=body, method="PUT")
req.add_header("Content-Type", "application/json")
with urllib.request.urlopen(req, timeout=30) as resp:
if resp.status >= 400:
raise RuntimeError(f"Failed ES upsert for {doc['concept_id']}: HTTP {resp.status}")
def gremlin_upsert(gremlin_url: str, concept: Dict[str, Any]) -> None:
from gremlin_python.driver import client as gremlin_client
from gremlin_python.driver.serializer import GraphSONSerializersV3d0
created_at = concept.get("created_at") or utc_now_iso()
updated_at = concept.get("updated_at") or utc_now_iso()
query = """
g.V().hasLabel('Concept').has('concept_id', concept_id).fold()
.coalesce(
unfold(),
addV('Concept').property('concept_id', concept_id).property('created_at', created_at)
)
.property('canonical_name', canonical_name)
.property('kind', kind)
.property('concept_type', concept_type)
.property('display_name', display_name)
.property('description', description)
.property('text', text)
.property('source_table', source_table)
.property('source_pk', source_pk)
.property('release_name', release_name)
.property('ref_hash', ref_hash)
.property('attributes_json', attributes_json)
.property('aliases', aliases_json)
.property('external_ids', external_ids_json)
.property('tags', tags_json)
.property('fingerprint', fingerprint)
.property('latest_cid', latest_cid)
.property('summary', summary)
.property('updated_at', updated_at)
.values('concept_id')
"""
c = gremlin_client.Client(
gremlin_url,
"g",
message_serializer=GraphSONSerializersV3d0(),
)
try:
c.submit(
query,
{
"concept_id": concept["concept_id"],
"canonical_name": concept.get("canonical_name") or "",
"kind": concept.get("kind") or "",
"concept_type": concept.get("concept_type") or "",
"display_name": concept.get("display_name") or "",
"description": concept.get("description") or "",
"text": concept.get("text") or "",
"source_table": concept.get("source_table") or "",
"source_pk": concept.get("source_pk") or "",
"release_name": concept.get("release_name") or "",
"ref_hash": concept.get("ref_hash") or "",
"attributes_json": concept.get("attributes_json") or "{}",
"aliases_json": json.dumps(concept.get("aliases", []), ensure_ascii=False),
"external_ids_json": json.dumps(concept.get("external_ids", {}), ensure_ascii=False),
"tags_json": json.dumps(concept.get("tags", []), ensure_ascii=False),
"fingerprint": concept["fingerprint"],
"latest_cid": concept.get("latest_cid") or "",
"summary": concept.get("summary") or "",
"created_at": created_at,
"updated_at": updated_at,
},
).all().result()
finally:
c.close()
def _infer_concept_type(row: Dict[str, Any], source_table: Optional[str]) -> str:
explicit = first_str(row, ["concept_type", "kind", "type"])
if explicit:
return explicit.lower()
lower_table = (source_table or "").lower()
if "messages" in lower_table:
return "message"
if "docs" in lower_table or "documents" in lower_table:
return "document"
if "message_id" in row:
return "message"
if "doc_id" in row or "document_id" in row:
return "document"
return "entity"
def _source_pk(row: Dict[str, Any]) -> Optional[str]:
return first_str(row, ["source_pk", "message_id", "doc_id", "document_id", "id", "uuid"])
def row_to_concept(
row: Dict[str, Any],
source_table: Optional[str],
release_name: Optional[str],
ref_hash: Optional[str],
) -> Optional[Dict[str, Any]]:
concept_type = _infer_concept_type(row, source_table)
source_pk = _source_pk(row)
display_name = first_str(
row,
[
"display_name",
"canonical_name",
"title",
"name",
"subject",
"doc_name",
"document_name",
],
)
if not display_name and source_pk:
display_name = f"{concept_type}:{source_pk}"
if not display_name:
display_name = first_str(row, ["body", "text", "content"])
if display_name:
display_name = display_name[:120]
if not display_name:
return None
external_ids = parse_json_maybe(row.get("external_ids"), dict, {})
aliases = parse_json_maybe(row.get("aliases"), list, [])
tags = parse_json_maybe(row.get("tags"), list, [])
kind = first_str(row, ["kind", "type", "doc_type", "document_type"]) or concept_type
concept_id = first_str(row, ["concept_id", "doc_id", "document_id", "id", "uuid"])
if not concept_id and source_pk:
concept_id = f"{concept_type}:{source_pk}"
if not isinstance(concept_id, str) or not concept_id.strip():
concept_id = hashlib.sha256(
f"{concept_type}|{display_name}|{json.dumps(external_ids, sort_keys=True)}".encode("utf-8")
).hexdigest()
description = first_str(row, ["description", "summary", "abstract"])
if not description:
body = first_str(row, ["content", "text", "body"])
if body:
description = body[:512]
text = first_str(row, ["text", "content", "body"])
if not text:
text = description
# Keep typed attributes stable and searchable without exploding ES mapping.
attributes_obj = row
return {
"concept_id": concept_id,
"concept_type": concept_type,
"display_name": display_name,
"description": description,
"text": text,
"source_table": source_table,
"source_pk": source_pk,
"release_name": release_name,
"ref_hash": ref_hash,
"attributes_json": json.dumps(attributes_obj, ensure_ascii=False, default=str, sort_keys=True),
"canonical_name": display_name,
"kind": kind,
"aliases": aliases,
"external_ids": external_ids,
"tags": tags,
"latest_cid": first_str(row, ["latest_cid", "cid", "ipfs_cid"]),
"summary": description,
"created_at": to_iso(row.get("created_at")) or utc_now_iso(),
"updated_at": to_iso(row.get("updated_at")) or utc_now_iso(),
"fingerprint": make_fingerprint(display_name, concept_type, external_ids),
}
def project_release(
manifest_file: Optional[str],
release_name: Optional[str],
concept_table: Optional[str],
nessie_ref: Optional[str],
releases_ref: Optional[str],
dry_run: bool,
targets: str,
) -> None:
if not manifest_file and not release_name:
raise ValueError("Provide either --manifest-file or --release-name.")
manifest: Optional[Dict[str, Any]] = load_manifest(manifest_file) if manifest_file else None
# Release-name mode: lookup manifest on registry ref (usually main), then project on release tag.
if manifest is None and release_name:
registry_ref = releases_ref or os.getenv("RELEASES_REF", "main")
spark, catalog = build_spark(registry_ref)
manifest = load_manifest_from_registry(spark, catalog, release_name)
ref = nessie_ref or infer_manifest_ref(manifest) or release_name
if ref != registry_ref:
spark.stop()
spark, catalog = build_spark(ref)
else:
ref = nessie_ref or (infer_manifest_ref(manifest) if manifest else None) or release_name
if not ref:
raise ValueError("Unable to infer Nessie ref/tag; pass --nessie-ref explicitly.")
spark, catalog = build_spark(ref)
table_identifiers: List[str] = extract_table_identifiers(manifest) if manifest else []
table = concept_table or (infer_concept_table(table_identifiers) if manifest else None)
if not table:
raise ValueError("Unable to infer concept table; pass --concept-table explicitly.")
if table.count(".") == 1:
table = f"{catalog}.{table}"
print(f"[INFO] Using Nessie ref/tag: {ref}")
print(f"[INFO] Reading table: {table}")
release_name_effective = None
ref_hash = None
if manifest:
rel = manifest.get("release")
if isinstance(rel, dict):
rel_name = rel.get("name")
if isinstance(rel_name, str) and rel_name.strip():
release_name_effective = rel_name.strip()
nes = manifest.get("nessie")
if isinstance(nes, dict):
ref_obj = nes.get("ref")
if isinstance(ref_obj, dict):
h = ref_obj.get("hash")
if isinstance(h, str) and h.strip():
ref_hash = h.strip()
if not release_name_effective and release_name and isinstance(release_name, str) and release_name.strip():
release_name_effective = release_name.strip()
df = spark.table(table)
rows = [r.asDict(recursive=True) for r in df.collect()]
concepts = [c for c in (row_to_concept(r, table, release_name_effective, ref_hash) for r in rows) if c]
print(f"[INFO] Read {len(rows)} rows, {len(concepts)} valid concepts")
print("[STEP] spark_read_done")
if dry_run:
print("[INFO] Dry-run enabled. No writes performed.")
return
use_es = targets in ("both", "es")
use_gremlin = targets in ("both", "gremlin")
print(f"[INFO] Projection targets: {targets}")
gremlin_url = os.getenv("GREMLIN_URL", "ws://localhost:8182/gremlin")
es_url = os.getenv("ES_URL", "http://localhost:9200")
es_index = os.getenv("ES_INDEX", "concepts")
if use_es:
ensure_es_index(es_url, es_index)
success = 0
failures = 0
gremlin_missing = False
es_missing = False
for concept in concepts:
try:
wrote_any = False
if use_gremlin and not gremlin_missing:
try:
gremlin_upsert(gremlin_url, concept)
wrote_any = True
except ModuleNotFoundError as e:
gremlin_missing = True
print(f"[WARN] Gremlin dependency missing ({e}). Continuing with ES only.")
except Exception as e:
print(f"[WARN] Gremlin upsert failed for {concept.get('concept_id')}: {e}")
if use_es and not es_missing:
try:
es_upsert(es_url, es_index, concept)
wrote_any = True
except ModuleNotFoundError as e:
es_missing = True
print(f"[WARN] ES dependency missing ({e}). Continuing with Gremlin only.")
except Exception as e:
print(f"[WARN] ES upsert failed for {concept.get('concept_id')}: {e}")
if wrote_any:
success += 1
else:
failures += 1
print(f"[WARN] No projection target succeeded for {concept.get('concept_id')}")
except Exception as e:
failures += 1
print(f"[WARN] Failed concept {concept.get('concept_id')}: {e}")
print("[STEP] projection_done")
print(f"[DONE] Projected {success} concepts ({failures} failed)")
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description="Project a lakehouse release into JanusGraph + Elasticsearch.")
p.add_argument("--manifest-file", help="Path to release manifest JSON")
p.add_argument("--release-name", help="Release name to load from releases_v2 registry")
p.add_argument("--concept-table", help="Full Iceberg table identifier holding concepts")
p.add_argument("--nessie-ref", help="Nessie branch/tag to read from (defaults to manifest tag)")
p.add_argument("--releases-ref", help="Nessie ref used to read releases_v2 (default: main)")
p.add_argument(
"--targets",
choices=["es", "gremlin", "both"],
default="both",
help="Projection targets to write (default: both)",
)
p.add_argument("--dry-run", action="store_true", help="Read and validate only")
return p.parse_args()
def main() -> None:
if load_dotenv is not None:
load_dotenv()
args = parse_args()
project_release(
manifest_file=args.manifest_file,
release_name=args.release_name,
concept_table=args.concept_table,
nessie_ref=args.nessie_ref,
releases_ref=args.releases_ref,
dry_run=args.dry_run,
targets=args.targets,
)
if __name__ == "__main__":
main()

8
requirements-app.txt Normal file
View file

@ -0,0 +1,8 @@
fastapi>=0.115,<1.0
uvicorn[standard]>=0.32,<1.0
pydantic>=2.9,<3.0
httpx>=0.28,<1.0
gremlinpython>=3.7,<4.0
python-dotenv>=1.0,<2.0
requests>=2.32,<3.0
websocket-client>=1.8,<2.0

View file

@ -0,0 +1,4 @@
pyspark==3.5.8
python-dotenv>=1.0,<2.0
httpx>=0.28,<1.0
gremlinpython>=3.7,<4.0

67
run-projector-standard.sh Executable file
View file

@ -0,0 +1,67 @@
#!/usr/bin/env bash
set -euo pipefail
# Canonical projector command for lakehouse-core.
# Usage:
# ./run-projector-standard.sh # publish (both targets)
# ./run-projector-standard.sh --dry-run # validate only
# ./run-projector-standard.sh --targets es # ES-only publish
# ./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
MANIFEST_FILE="${MANIFEST_FILE:-./manifests/rel_2026-02-14_docs-v1.json}"
CONCEPT_TABLE="${CONCEPT_TABLE:-lake.db1.docs}"
TARGETS="${TARGETS:-both}"
RELEASE_NAME="${RELEASE_NAME:-}"
MODE=""
while [[ $# -gt 0 ]]; do
case "$1" in
--dry-run)
MODE="--dry-run"
shift
;;
--targets)
TARGETS="${2:-}"
if [[ -z "$TARGETS" ]]; then
echo "--targets requires one of: es|gremlin|both" >&2
exit 1
fi
shift 2
;;
--manifest-file)
MANIFEST_FILE="${2:-}"
if [[ -z "$MANIFEST_FILE" ]]; then
echo "--manifest-file requires a value" >&2
exit 1
fi
shift 2
;;
--release-name)
RELEASE_NAME="${2:-}"
if [[ -z "$RELEASE_NAME" ]]; then
echo "--release-name requires a value" >&2
exit 1
fi
shift 2
;;
--concept-table)
CONCEPT_TABLE="${2:-}"
if [[ -z "$CONCEPT_TABLE" ]]; then
echo "--concept-table requires a value" >&2
exit 1
fi
shift 2
;;
*)
echo "Unknown argument: $1" >&2
exit 1
;;
esac
done
if [[ "$TARGETS" != "es" && "$TARGETS" != "gremlin" && "$TARGETS" != "both" ]]; then
echo "Invalid --targets value: $TARGETS (expected es|gremlin|both)" >&2
exit 1
fi
./run-projector-via-spark-container.sh "$MANIFEST_FILE" "$CONCEPT_TABLE" "$MODE" "$TARGETS" "$RELEASE_NAME"

View file

@ -0,0 +1,63 @@
#!/usr/bin/env bash
set -euo pipefail
MANIFEST_FILE="${1:-/tmp/rel_2026-02-14_docs-v1.json}"
CONCEPT_TABLE="${2:-lake.db1.docs}"
MODE="${3:-}"
TARGETS="${4:-both}"
RELEASE_NAME="${5:-${RELEASE_NAME:-}}"
CONTAINER_NAME="${SPARK_CONTAINER_NAME:-spark}"
SPARK_PROPS="${SPARK_PROPS:-/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf}"
PACKAGES="${SPARK_PACKAGES:-org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.104.5}"
SCRIPT_LOCAL="${SCRIPT_LOCAL:-./release_projector.py}"
SCRIPT_REMOTE="/tmp/release_projector.py"
MANIFEST_REMOTE="/tmp/$(basename "$MANIFEST_FILE")"
if [[ ! -f "$SCRIPT_LOCAL" ]]; then
echo "release_projector.py not found at: $SCRIPT_LOCAL" >&2
exit 1
fi
if [[ -z "$RELEASE_NAME" && ! -f "$MANIFEST_FILE" ]]; then
echo "manifest file not found: $MANIFEST_FILE (or provide release name arg5)" >&2
exit 1
fi
docker cp "$SCRIPT_LOCAL" "$CONTAINER_NAME":"$SCRIPT_REMOTE"
if [[ -f "$MANIFEST_FILE" ]]; then
docker cp "$MANIFEST_FILE" "$CONTAINER_NAME":"$MANIFEST_REMOTE"
fi
ARGS=(
"$SCRIPT_REMOTE"
"--concept-table" "$CONCEPT_TABLE"
"--targets" "$TARGETS"
)
if [[ -n "$RELEASE_NAME" ]]; then
ARGS+=("--release-name" "$RELEASE_NAME")
else
ARGS+=("--manifest-file" "$MANIFEST_REMOTE")
fi
if [[ -n "$MODE" ]]; then
ARGS+=("$MODE")
fi
docker exec -e AWS_REGION="${AWS_REGION:-us-east-1}" \
-e AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" \
-e NESSIE_URI="${NESSIE_URI:-http://lakehouse-core:19120/api/v2}" \
-e NESSIE_WAREHOUSE="${NESSIE_WAREHOUSE:-s3a://lakehouse/warehouse}" \
-e S3_ENDPOINT="${S3_ENDPOINT:-http://lakehouse-core:9000}" \
-e AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID:-minioadmin}" \
-e AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY:-minioadmin}" \
-e GREMLIN_URL="${GREMLIN_URL:-ws://janus.rakeroots.lan:8182/gremlin}" \
-e ES_URL="${ES_URL:-http://janus.rakeroots.lan:9200}" \
-e ES_INDEX="${ES_INDEX:-concepts}" \
"$CONTAINER_NAME" \
/opt/spark/bin/spark-submit \
--properties-file "$SPARK_PROPS" \
--packages "$PACKAGES" \
"${ARGS[@]}"

11
setup_local_env.sh Executable file
View file

@ -0,0 +1,11 @@
#!/usr/bin/env bash
set -euo pipefail
VENV_DIR="${1:-.venv}"
python3 -m venv "$VENV_DIR"
"$VENV_DIR/bin/pip" install --upgrade pip
"$VENV_DIR/bin/pip" install -r requirements-app.txt -r requirements-projector.txt
echo "Environment ready: $VENV_DIR"
echo "Activate with: source $VENV_DIR/bin/activate"

215
ui/assets/app.js Normal file
View file

@ -0,0 +1,215 @@
function getConfig() {
return {
apiKey: document.getElementById("apiKey").value.trim(),
releaseName: document.getElementById("releaseName").value.trim(),
};
}
function saveConfig() {
const cfg = getConfig();
cfg.chatSessionId = document.getElementById("chatSessionId").value.trim();
localStorage.setItem("assistant_ui_cfg", JSON.stringify(cfg));
}
function loadConfig() {
try {
const raw = localStorage.getItem("assistant_ui_cfg");
if (!raw) return;
const cfg = JSON.parse(raw);
document.getElementById("apiKey").value = cfg.apiKey || "";
document.getElementById("releaseName").value = cfg.releaseName || "";
document.getElementById("chatSessionId").value = cfg.chatSessionId || "main";
} catch (_) {}
}
async function apiGet(path, params) {
const cfg = getConfig();
const url = new URL(path, window.location.origin);
Object.entries(params || {}).forEach(([k, v]) => {
if (v !== null && v !== undefined && String(v).length > 0) url.searchParams.set(k, String(v));
});
const r = await fetch(url, {
headers: { "X-Admin-Api-Key": cfg.apiKey },
});
if (!r.ok) throw new Error(await r.text());
return r.json();
}
async function apiPost(path, payload) {
const cfg = getConfig();
const r = await fetch(path, {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-Admin-Api-Key": cfg.apiKey,
},
body: JSON.stringify(payload),
});
if (!r.ok) throw new Error(await r.text());
return r.json();
}
function renderRows(target, rows, formatter) {
target.innerHTML = "";
if (!rows || rows.length === 0) {
target.innerHTML = '<div class="row">No rows.</div>';
return;
}
rows.forEach((row) => {
const el = document.createElement("div");
el.className = "row";
el.innerHTML = formatter(row);
target.appendChild(el);
});
}
async function loadInbox() {
const cfg = getConfig();
const q = document.getElementById("inboxQuery").value.trim();
const out = document.getElementById("inboxResults");
out.innerHTML = '<div class="row">Loading...</div>';
try {
const data = await apiGet("/assistant/inbox", { release_name: cfg.releaseName, q, limit: 20 });
renderRows(out, data.rows || [], (r) => {
const text = (r.text || r.summary || r.description || "").slice(0, 280);
return `
<div><strong>${r.display_name || r.concept_id || "message"}</strong></div>
<div>${text || "(no text)"}</div>
<div class="meta">${r.source_pk || ""} | ${r.release_name || ""}</div>
`;
});
} catch (e) {
out.innerHTML = `<div class="row">Error: ${String(e)}</div>`;
}
}
async function loadTasks() {
const cfg = getConfig();
const onlyPending = document.getElementById("onlyPending").checked;
const out = document.getElementById("taskResults");
out.innerHTML = '<div class="row">Loading...</div>';
try {
const data = await apiGet("/assistant/tasks", {
release_name: cfg.releaseName,
only_pending: onlyPending,
limit: 30,
});
renderRows(out, data.rows || [], (r) => {
const safeTodo = (r.todo || "").replace(/"/g, "&quot;");
return `
<div><strong>${r.todo || "(empty task)"}</strong></div>
<div class="meta">status=${r.status} | due=${r.due_hint || "-"} | who=${r.who || "-"}</div>
<div class="meta">source=${r.source_pk || ""} | release=${r.release_name || ""}</div>
<div style="margin-top:6px"><button data-goal="${safeTodo}" class="use-goal">Use as goal</button></div>
`;
});
document.querySelectorAll(".use-goal").forEach((btn) => {
btn.addEventListener("click", () => {
const goal = btn.getAttribute("data-goal") || "";
document.getElementById("goalText").value = goal;
});
});
} catch (e) {
out.innerHTML = `<div class="row">Error: ${String(e)}</div>`;
}
}
async function makeDraft() {
const cfg = getConfig();
const goal = document.getElementById("goalText").value.trim();
const recipient = document.getElementById("recipient").value.trim();
const out = document.getElementById("draftOutput");
if (!goal) {
out.textContent = "Provide goal text first.";
return;
}
out.textContent = "Generating...";
try {
const data = await apiPost("/assistant/draft", {
task_type: "message",
goal,
recipient: recipient || null,
tone: "friendly-professional",
constraints: ["keep it concise"],
release_name: cfg.releaseName || null,
max_sources: 5,
});
const sourceLine = (data.sources || []).map((s) => s.concept_id).filter(Boolean).slice(0, 5).join(", ");
out.textContent = `${data.draft || ""}\n\nconfidence=${data.confidence}\nneeds_review=${data.needs_review}\nsources=${sourceLine}`;
} catch (e) {
out.textContent = `Error: ${String(e)}`;
}
}
async function saveLearn() {
const cfg = getConfig();
const title = document.getElementById("learnTitle").value.trim();
const tags = document.getElementById("learnTags").value
.split(",")
.map((x) => x.trim())
.filter(Boolean);
const text = document.getElementById("learnText").value.trim();
const out = document.getElementById("learnOutput");
if (!text) {
out.textContent = "Provide note text first.";
return;
}
out.textContent = "Saving...";
try {
const data = await apiPost("/assistant/learn", {
text,
title: title || null,
tags,
release_name: cfg.releaseName || null,
});
out.textContent = `saved=${data.stored}\nconcept_id=${data.concept_id}\ntitle=${data.title}`;
document.getElementById("learnText").value = "";
} catch (e) {
out.textContent = `Error: ${String(e)}`;
}
}
function appendChat(role, text, meta) {
const target = document.getElementById("chatTranscript");
const el = document.createElement("div");
el.className = "row";
el.innerHTML = `
<div><strong>${role}</strong></div>
<div>${(text || "").replace(/\n/g, "<br/>")}</div>
${meta ? `<div class="meta">${meta}</div>` : ""}
`;
target.prepend(el);
}
async function sendChat() {
const cfg = getConfig();
const sessionInput = document.getElementById("chatSessionId");
const session_id = (sessionInput.value || "main").trim();
sessionInput.value = session_id;
const messageEl = document.getElementById("chatMessage");
const message = messageEl.value.trim();
if (!message) return;
appendChat("user", message, `session=${session_id}`);
messageEl.value = "";
try {
const data = await apiPost("/assistant/chat", {
session_id,
message,
release_name: cfg.releaseName || null,
max_sources: 6,
});
const sourceLine = (data.sources || []).map((s) => s.concept_id).filter(Boolean).slice(0, 4).join(", ");
appendChat("assistant", data.answer || "", `confidence=${data.confidence} | sources=${sourceLine || "-"}`);
} catch (e) {
appendChat("assistant", `Error: ${String(e)}`, "");
}
}
document.getElementById("saveConfig").addEventListener("click", saveConfig);
document.getElementById("loadInbox").addEventListener("click", loadInbox);
document.getElementById("loadTasks").addEventListener("click", loadTasks);
document.getElementById("makeDraft").addEventListener("click", makeDraft);
document.getElementById("saveLearn").addEventListener("click", saveLearn);
document.getElementById("sendChat").addEventListener("click", sendChat);
loadConfig();

124
ui/assets/styles.css Normal file
View file

@ -0,0 +1,124 @@
:root {
--bg: #f2f4f5;
--panel: #ffffff;
--ink: #182126;
--muted: #5c6770;
--line: #dde4e8;
--accent: #0f766e;
}
* {
box-sizing: border-box;
}
body {
margin: 0;
font-family: "IBM Plex Sans", "Segoe UI", sans-serif;
color: var(--ink);
background: linear-gradient(165deg, #e9eff2 0%, #f8fafb 100%);
}
.layout {
max-width: 1100px;
margin: 0 auto;
padding: 18px;
display: grid;
gap: 14px;
}
.topbar {
background: var(--panel);
border: 1px solid var(--line);
border-radius: 10px;
padding: 12px;
display: flex;
justify-content: space-between;
align-items: center;
gap: 12px;
}
.topbar h1,
.panel h2 {
margin: 0;
font-size: 18px;
}
.panel {
background: var(--panel);
border: 1px solid var(--line);
border-radius: 10px;
padding: 12px;
}
.panel-header {
display: flex;
justify-content: space-between;
align-items: center;
gap: 12px;
margin-bottom: 8px;
}
.controls {
display: flex;
gap: 8px;
align-items: center;
flex-wrap: wrap;
}
input,
textarea,
button {
font: inherit;
}
input,
textarea {
border: 1px solid var(--line);
border-radius: 7px;
padding: 8px;
background: #fff;
}
button {
border: 1px solid #0d5f59;
background: var(--accent);
color: #fff;
border-radius: 7px;
padding: 8px 10px;
cursor: pointer;
}
button:hover {
filter: brightness(0.95);
}
.list {
display: grid;
gap: 8px;
}
.row {
border: 1px solid var(--line);
border-radius: 8px;
padding: 8px;
}
.row .meta {
color: var(--muted);
font-size: 12px;
margin-top: 4px;
}
.output {
white-space: pre-wrap;
border: 1px solid var(--line);
border-radius: 8px;
padding: 10px;
min-height: 96px;
background: #fbfdfe;
}
#chatTranscript {
max-height: 360px;
overflow: auto;
}

82
ui/index.html Normal file
View file

@ -0,0 +1,82 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Jecio Assistant Console</title>
<link rel="stylesheet" href="/ui/assets/styles.css" />
</head>
<body>
<main class="layout">
<header class="topbar">
<h1>Assistant Console</h1>
<div class="controls">
<input id="apiKey" type="password" placeholder="X-Admin-Api-Key" />
<input id="releaseName" type="text" placeholder="release_name (optional)" />
<button id="saveConfig">Save</button>
</div>
</header>
<section class="panel">
<div class="panel-header">
<h2>Inbox</h2>
<div class="controls">
<input id="inboxQuery" type="text" placeholder="Search text (optional)" />
<button id="loadInbox">Load Inbox</button>
</div>
</div>
<div id="inboxResults" class="list"></div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Pending Tasks</h2>
<div class="controls">
<label><input id="onlyPending" type="checkbox" checked /> Only pending</label>
<button id="loadTasks">Load Tasks</button>
</div>
</div>
<div id="taskResults" class="list"></div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Draft</h2>
<div class="controls">
<input id="recipient" type="text" placeholder="Recipient (optional)" />
<button id="makeDraft">Draft From Goal</button>
</div>
</div>
<textarea id="goalText" rows="3" placeholder="Goal text (or click 'Use as goal' from a task)"></textarea>
<pre id="draftOutput" class="output"></pre>
</section>
<section class="panel">
<div class="panel-header">
<h2>Learn</h2>
<div class="controls">
<input id="learnTitle" type="text" placeholder="Title (optional)" />
<input id="learnTags" type="text" placeholder="tags comma-separated (optional)" />
<button id="saveLearn">Save Note</button>
</div>
</div>
<textarea id="learnText" rows="3" placeholder="Knowledge note you want the assistant to remember"></textarea>
<pre id="learnOutput" class="output"></pre>
</section>
<section class="panel">
<div class="panel-header">
<h2>Chat</h2>
<div class="controls">
<input id="chatSessionId" type="text" placeholder="session_id (default: main)" />
<button id="sendChat">Send</button>
</div>
</div>
<textarea id="chatMessage" rows="2" placeholder="Ask the assistant..."></textarea>
<div id="chatTranscript" class="list"></div>
</section>
</main>
<script src="/ui/assets/app.js"></script>
</body>
</html>

106
write_assistant_action.py Normal file
View file

@ -0,0 +1,106 @@
import argparse
import json
import base64
from pyspark.sql import SparkSession, types as T
def d(s: str) -> str:
if not s:
return ""
return base64.b64decode(s.encode("ascii")).decode("utf-8")
def main() -> None:
p = argparse.ArgumentParser(description="Write assistant action row via Spark DataFrame")
p.add_argument("--table", required=True)
p.add_argument("--action-id", required=True)
p.add_argument("--created-at-utc", required=True)
p.add_argument("--task-type", required=True)
p.add_argument("--release-name", default="")
p.add_argument("--objective-b64", default="")
p.add_argument("--step-id", required=True)
p.add_argument("--step-title-b64", default="")
p.add_argument("--action-type", required=True)
p.add_argument("--requires-approval", default="false")
p.add_argument("--approved", default="false")
p.add_argument("--status", required=True)
p.add_argument("--output-b64", default="")
p.add_argument("--error-b64", default="")
args = p.parse_args()
requires_approval = str(args.requires_approval).lower() == "true"
approved = str(args.approved).lower() == "true"
objective = d(args.objective_b64)
step_title = d(args.step_title_b64)
output_json = d(args.output_b64)
error_text = d(args.error_b64)
if not output_json:
output_json = "{}"
try:
json.loads(output_json)
except Exception:
output_json = "{}"
spark = SparkSession.builder.appName("write-assistant-action").getOrCreate()
spark.sql(
f"""
CREATE TABLE IF NOT EXISTS {args.table} (
action_id STRING,
created_at_utc STRING,
task_type STRING,
release_name STRING,
objective STRING,
step_id STRING,
step_title STRING,
action_type STRING,
requires_approval BOOLEAN,
approved BOOLEAN,
status STRING,
output_json STRING,
error_text STRING
) USING iceberg
"""
)
schema = T.StructType(
[
T.StructField("action_id", T.StringType(), False),
T.StructField("created_at_utc", T.StringType(), False),
T.StructField("task_type", T.StringType(), False),
T.StructField("release_name", T.StringType(), True),
T.StructField("objective", T.StringType(), True),
T.StructField("step_id", T.StringType(), False),
T.StructField("step_title", T.StringType(), True),
T.StructField("action_type", T.StringType(), False),
T.StructField("requires_approval", T.BooleanType(), False),
T.StructField("approved", T.BooleanType(), False),
T.StructField("status", T.StringType(), False),
T.StructField("output_json", T.StringType(), True),
T.StructField("error_text", T.StringType(), True),
]
)
row = [
(
args.action_id,
args.created_at_utc,
args.task_type,
args.release_name or "",
objective,
args.step_id,
step_title,
args.action_type,
requires_approval,
approved,
args.status,
output_json,
error_text,
)
]
df = spark.createDataFrame(row, schema=schema)
df.writeTo(args.table).append()
print(f"[DONE] Recorded assistant action {args.action_id} into {args.table}")
if __name__ == "__main__":
main()

103
write_assistant_feedback.py Normal file
View file

@ -0,0 +1,103 @@
import argparse
import base64
import json
from pyspark.sql import SparkSession, types as T
def d(s: str) -> str:
if not s:
return ""
return base64.b64decode(s.encode("ascii")).decode("utf-8")
def main() -> None:
p = argparse.ArgumentParser(description="Write assistant feedback row via Spark DataFrame")
p.add_argument("--table", required=True)
p.add_argument("--feedback-id", required=True)
p.add_argument("--created-at-utc", required=True)
p.add_argument("--outcome", required=True)
p.add_argument("--task-type", required=True)
p.add_argument("--release-name", default="")
p.add_argument("--confidence", type=float, default=0.0)
p.add_argument("--needs-review", default="true")
p.add_argument("--goal-b64", default="")
p.add_argument("--draft-b64", default="")
p.add_argument("--final-b64", default="")
p.add_argument("--sources-b64", default="")
p.add_argument("--notes-b64", default="")
args = p.parse_args()
needs_review = str(args.needs_review).lower() == "true"
goal = d(args.goal_b64)
draft_text = d(args.draft_b64)
final_text = d(args.final_b64)
sources_json = d(args.sources_b64)
notes = d(args.notes_b64)
if not sources_json:
sources_json = "[]"
# Validate JSON shape but keep raw string in table.
try:
json.loads(sources_json)
except Exception:
sources_json = "[]"
spark = SparkSession.builder.appName("write-assistant-feedback").getOrCreate()
spark.sql(
f"""
CREATE TABLE IF NOT EXISTS {args.table} (
feedback_id STRING,
created_at_utc STRING,
outcome STRING,
task_type STRING,
release_name STRING,
confidence DOUBLE,
needs_review BOOLEAN,
goal STRING,
draft_text STRING,
final_text STRING,
sources_json STRING,
notes STRING
) USING iceberg
"""
)
schema = T.StructType(
[
T.StructField("feedback_id", T.StringType(), False),
T.StructField("created_at_utc", T.StringType(), False),
T.StructField("outcome", T.StringType(), False),
T.StructField("task_type", T.StringType(), False),
T.StructField("release_name", T.StringType(), True),
T.StructField("confidence", T.DoubleType(), True),
T.StructField("needs_review", T.BooleanType(), False),
T.StructField("goal", T.StringType(), True),
T.StructField("draft_text", T.StringType(), True),
T.StructField("final_text", T.StringType(), True),
T.StructField("sources_json", T.StringType(), True),
T.StructField("notes", T.StringType(), True),
]
)
row = [
(
args.feedback_id,
args.created_at_utc,
args.outcome,
args.task_type,
args.release_name or "",
float(args.confidence),
needs_review,
goal,
draft_text,
final_text,
sources_json,
notes,
)
]
df = spark.createDataFrame(row, schema=schema)
df.writeTo(args.table).append()
print(f"[DONE] Recorded assistant feedback {args.feedback_id} into {args.table}")
if __name__ == "__main__":
main()