143 lines
4.4 KiB
Markdown
143 lines
4.4 KiB
Markdown
|
|
# Release Projector
|
||
|
|
|
||
|
|
`release_projector.py` rebuilds serving projections (JanusGraph + Elasticsearch) from a lakehouse release manifest.
|
||
|
|
|
||
|
|
## What it does
|
||
|
|
|
||
|
|
1. Loads a release manifest JSON (or a `releases_v2` row containing `manifest_json`).
|
||
|
|
2. Resolves Nessie tag/ref from the manifest (or `--nessie-ref`).
|
||
|
|
3. Reads the concept Iceberg table from that ref through Spark + Iceberg + Nessie.
|
||
|
|
4. Upserts each concept into JanusGraph and Elasticsearch.
|
||
|
|
|
||
|
|
`release_projector.py` now accepts both concept-shaped rows and document-shaped rows.
|
||
|
|
For docs tables, it auto-detects typical columns:
|
||
|
|
- name: `canonical_name|title|name|subject`
|
||
|
|
- id: `concept_id|doc_id|document_id|id|uuid`
|
||
|
|
- summary text: `summary|description|abstract|content|text|body`
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
|
||
|
|
- Python deps: `python-dotenv`, `httpx`, `gremlinpython`, `pyspark`
|
||
|
|
- Spark/Iceberg/Nessie jars (default package coordinates are baked into script)
|
||
|
|
- Network access to:
|
||
|
|
- Nessie API (example: `http://lakehouse-core:19120/api/v2`)
|
||
|
|
- MinIO S3 endpoint (example: `http://lakehouse-core:9000`)
|
||
|
|
- JanusGraph Gremlin endpoint
|
||
|
|
- Elasticsearch endpoint
|
||
|
|
|
||
|
|
## Recommended isolated env
|
||
|
|
|
||
|
|
Do not install projector dependencies into system Python.
|
||
|
|
|
||
|
|
## Preferred: existing spark container on lakehouse-core
|
||
|
|
|
||
|
|
This reuses your existing `spark` container and Spark properties file.
|
||
|
|
|
||
|
|
Standard command (frozen):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./run-projector-standard.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Run by release name (no manifest path):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
|
||
|
|
```
|
||
|
|
|
||
|
|
Standard dry-run:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./run-projector-standard.sh --dry-run
|
||
|
|
```
|
||
|
|
|
||
|
|
Copy files to host:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
rsync -av --delete /home/niklas/projects/jecio/ lakehouse-core.rakeroots.lan:/tmp/jecio/
|
||
|
|
```
|
||
|
|
|
||
|
|
Run dry-run projection inside `spark` container:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-via-spark-container.sh ./manifests/rel_2026-02-14_docs-v1.json lake.db1.docs --dry-run es'
|
||
|
|
```
|
||
|
|
|
||
|
|
Run publish projection (writes Janus/ES):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-standard.sh'
|
||
|
|
```
|
||
|
|
|
||
|
|
`run-projector-via-spark-container.sh` uses:
|
||
|
|
- container: `spark` (override with `SPARK_CONTAINER_NAME`)
|
||
|
|
- properties file: `/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf` (override with `SPARK_PROPS`)
|
||
|
|
- Spark packages: Iceberg + Nessie extensions (override with `SPARK_PACKAGES`)
|
||
|
|
- arg4 `targets`: `es|gremlin|both` (default `both`)
|
||
|
|
- arg5 `release_name`: optional; if set, loads manifest from `releases_v2`
|
||
|
|
|
||
|
|
Direct projector usage:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets es --dry-run
|
||
|
|
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets both
|
||
|
|
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets es --dry-run
|
||
|
|
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets both
|
||
|
|
```
|
||
|
|
|
||
|
|
Local setup (fallback):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./setup_local_env.sh .venv-projector
|
||
|
|
source .venv-projector/bin/activate
|
||
|
|
```
|
||
|
|
|
||
|
|
Remote setup (fallback, venv on `lakehouse-core`):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
scp release_projector.py requirements-projector.txt manifests/rel_2026-02-14_docs-v1.json lakehouse-core.rakeroots.lan:/tmp/
|
||
|
|
ssh lakehouse-core.rakeroots.lan 'python3 -m venv /tmp/jecio-projector-venv && /tmp/jecio-projector-venv/bin/pip install --upgrade pip && /tmp/jecio-projector-venv/bin/pip install -r /tmp/requirements-projector.txt'
|
||
|
|
```
|
||
|
|
|
||
|
|
## Required env vars (example)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
export NESSIE_URI=http://lakehouse-core:19120/api/v2
|
||
|
|
export NESSIE_WAREHOUSE=s3a://lakehouse/warehouse
|
||
|
|
export S3_ENDPOINT=http://lakehouse-core:9000
|
||
|
|
export AWS_ACCESS_KEY_ID=minioadmin
|
||
|
|
export AWS_SECRET_ACCESS_KEY=minioadmin
|
||
|
|
|
||
|
|
export GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin
|
||
|
|
export ES_URL=http://janus.rakeroots.lan:9200
|
||
|
|
export ES_INDEX=concepts
|
||
|
|
```
|
||
|
|
|
||
|
|
## Run
|
||
|
|
|
||
|
|
```bash
|
||
|
|
/tmp/jecio-projector-venv/bin/python /tmp/release_projector.py \
|
||
|
|
--manifest-file /tmp/rel_2026-02-14_docs-v1.json \
|
||
|
|
--concept-table lake.db1.docs \
|
||
|
|
--dry-run
|
||
|
|
```
|
||
|
|
|
||
|
|
Or local:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 release_projector.py \
|
||
|
|
--manifest-file /path/to/release.json \
|
||
|
|
--concept-table lake.db1.concepts
|
||
|
|
```
|
||
|
|
|
||
|
|
If the manifest has a Nessie tag in fields like `nessie.tag`, you can omit `--nessie-ref`.
|
||
|
|
|
||
|
|
Dry run:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 release_projector.py \
|
||
|
|
--manifest-file /path/to/release.json \
|
||
|
|
--concept-table lake.db1.concepts \
|
||
|
|
--dry-run
|
||
|
|
```
|