jecio/PROJECTOR_USAGE.md
2026-02-14 21:10:26 +01:00

4.4 KiB

Release Projector

release_projector.py rebuilds serving projections (JanusGraph + Elasticsearch) from a lakehouse release manifest.

What it does

  1. Loads a release manifest JSON (or a releases_v2 row containing manifest_json).
  2. Resolves Nessie tag/ref from the manifest (or --nessie-ref).
  3. Reads the concept Iceberg table from that ref through Spark + Iceberg + Nessie.
  4. Upserts each concept into JanusGraph and Elasticsearch.

release_projector.py now accepts both concept-shaped rows and document-shaped rows. For docs tables, it auto-detects typical columns:

  • name: canonical_name|title|name|subject
  • id: concept_id|doc_id|document_id|id|uuid
  • summary text: summary|description|abstract|content|text|body

Prerequisites

  • Python deps: python-dotenv, httpx, gremlinpython, pyspark
  • Spark/Iceberg/Nessie jars (default package coordinates are baked into script)
  • Network access to:
    • Nessie API (example: http://lakehouse-core:19120/api/v2)
    • MinIO S3 endpoint (example: http://lakehouse-core:9000)
    • JanusGraph Gremlin endpoint
    • Elasticsearch endpoint

Do not install projector dependencies into system Python.

Preferred: existing spark container on lakehouse-core

This reuses your existing spark container and Spark properties file.

Standard command (frozen):

./run-projector-standard.sh

Run by release name (no manifest path):

./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1

Standard dry-run:

./run-projector-standard.sh --dry-run

Copy files to host:

rsync -av --delete /home/niklas/projects/jecio/ lakehouse-core.rakeroots.lan:/tmp/jecio/

Run dry-run projection inside spark container:

ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-via-spark-container.sh ./manifests/rel_2026-02-14_docs-v1.json lake.db1.docs --dry-run es'

Run publish projection (writes Janus/ES):

ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-standard.sh'

run-projector-via-spark-container.sh uses:

  • container: spark (override with SPARK_CONTAINER_NAME)
  • properties file: /opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf (override with SPARK_PROPS)
  • Spark packages: Iceberg + Nessie extensions (override with SPARK_PACKAGES)
  • arg4 targets: es|gremlin|both (default both)
  • arg5 release_name: optional; if set, loads manifest from releases_v2

Direct projector usage:

python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets both
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets both

Local setup (fallback):

./setup_local_env.sh .venv-projector
source .venv-projector/bin/activate

Remote setup (fallback, venv on lakehouse-core):

scp release_projector.py requirements-projector.txt manifests/rel_2026-02-14_docs-v1.json lakehouse-core.rakeroots.lan:/tmp/
ssh lakehouse-core.rakeroots.lan 'python3 -m venv /tmp/jecio-projector-venv && /tmp/jecio-projector-venv/bin/pip install --upgrade pip && /tmp/jecio-projector-venv/bin/pip install -r /tmp/requirements-projector.txt'

Required env vars (example)

export NESSIE_URI=http://lakehouse-core:19120/api/v2
export NESSIE_WAREHOUSE=s3a://lakehouse/warehouse
export S3_ENDPOINT=http://lakehouse-core:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin

export GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin
export ES_URL=http://janus.rakeroots.lan:9200
export ES_INDEX=concepts

Run

/tmp/jecio-projector-venv/bin/python /tmp/release_projector.py \
  --manifest-file /tmp/rel_2026-02-14_docs-v1.json \
  --concept-table lake.db1.docs \
  --dry-run

Or local:

python3 release_projector.py \
  --manifest-file /path/to/release.json \
  --concept-table lake.db1.concepts

If the manifest has a Nessie tag in fields like nessie.tag, you can omit --nessie-ref.

Dry run:

python3 release_projector.py \
  --manifest-file /path/to/release.json \
  --concept-table lake.db1.concepts \
  --dry-run