4.4 KiB
Release Projector
release_projector.py rebuilds serving projections (JanusGraph + Elasticsearch) from a lakehouse release manifest.
What it does
- Loads a release manifest JSON (or a
releases_v2row containingmanifest_json). - Resolves Nessie tag/ref from the manifest (or
--nessie-ref). - Reads the concept Iceberg table from that ref through Spark + Iceberg + Nessie.
- Upserts each concept into JanusGraph and Elasticsearch.
release_projector.py now accepts both concept-shaped rows and document-shaped rows.
For docs tables, it auto-detects typical columns:
- name:
canonical_name|title|name|subject - id:
concept_id|doc_id|document_id|id|uuid - summary text:
summary|description|abstract|content|text|body
Prerequisites
- Python deps:
python-dotenv,httpx,gremlinpython,pyspark - Spark/Iceberg/Nessie jars (default package coordinates are baked into script)
- Network access to:
- Nessie API (example:
http://lakehouse-core:19120/api/v2) - MinIO S3 endpoint (example:
http://lakehouse-core:9000) - JanusGraph Gremlin endpoint
- Elasticsearch endpoint
- Nessie API (example:
Recommended isolated env
Do not install projector dependencies into system Python.
Preferred: existing spark container on lakehouse-core
This reuses your existing spark container and Spark properties file.
Standard command (frozen):
./run-projector-standard.sh
Run by release name (no manifest path):
./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
Standard dry-run:
./run-projector-standard.sh --dry-run
Copy files to host:
rsync -av --delete /home/niklas/projects/jecio/ lakehouse-core.rakeroots.lan:/tmp/jecio/
Run dry-run projection inside spark container:
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-via-spark-container.sh ./manifests/rel_2026-02-14_docs-v1.json lake.db1.docs --dry-run es'
Run publish projection (writes Janus/ES):
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-standard.sh'
run-projector-via-spark-container.sh uses:
- container:
spark(override withSPARK_CONTAINER_NAME) - properties file:
/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf(override withSPARK_PROPS) - Spark packages: Iceberg + Nessie extensions (override with
SPARK_PACKAGES) - arg4
targets:es|gremlin|both(defaultboth) - arg5
release_name: optional; if set, loads manifest fromreleases_v2
Direct projector usage:
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets both
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets both
Local setup (fallback):
./setup_local_env.sh .venv-projector
source .venv-projector/bin/activate
Remote setup (fallback, venv on lakehouse-core):
scp release_projector.py requirements-projector.txt manifests/rel_2026-02-14_docs-v1.json lakehouse-core.rakeroots.lan:/tmp/
ssh lakehouse-core.rakeroots.lan 'python3 -m venv /tmp/jecio-projector-venv && /tmp/jecio-projector-venv/bin/pip install --upgrade pip && /tmp/jecio-projector-venv/bin/pip install -r /tmp/requirements-projector.txt'
Required env vars (example)
export NESSIE_URI=http://lakehouse-core:19120/api/v2
export NESSIE_WAREHOUSE=s3a://lakehouse/warehouse
export S3_ENDPOINT=http://lakehouse-core:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin
export ES_URL=http://janus.rakeroots.lan:9200
export ES_INDEX=concepts
Run
/tmp/jecio-projector-venv/bin/python /tmp/release_projector.py \
--manifest-file /tmp/rel_2026-02-14_docs-v1.json \
--concept-table lake.db1.docs \
--dry-run
Or local:
python3 release_projector.py \
--manifest-file /path/to/release.json \
--concept-table lake.db1.concepts
If the manifest has a Nessie tag in fields like nessie.tag, you can omit --nessie-ref.
Dry run:
python3 release_projector.py \
--manifest-file /path/to/release.json \
--concept-table lake.db1.concepts \
--dry-run