jecio/PROJECTOR_USAGE.md

143 lines
4.4 KiB
Markdown
Raw Normal View History

# Release Projector
`release_projector.py` rebuilds serving projections (JanusGraph + Elasticsearch) from a lakehouse release manifest.
## What it does
1. Loads a release manifest JSON (or a `releases_v2` row containing `manifest_json`).
2. Resolves Nessie tag/ref from the manifest (or `--nessie-ref`).
3. Reads the concept Iceberg table from that ref through Spark + Iceberg + Nessie.
4. Upserts each concept into JanusGraph and Elasticsearch.
`release_projector.py` now accepts both concept-shaped rows and document-shaped rows.
For docs tables, it auto-detects typical columns:
- name: `canonical_name|title|name|subject`
- id: `concept_id|doc_id|document_id|id|uuid`
- summary text: `summary|description|abstract|content|text|body`
## Prerequisites
- Python deps: `python-dotenv`, `httpx`, `gremlinpython`, `pyspark`
- Spark/Iceberg/Nessie jars (default package coordinates are baked into script)
- Network access to:
- Nessie API (example: `http://lakehouse-core:19120/api/v2`)
- MinIO S3 endpoint (example: `http://lakehouse-core:9000`)
- JanusGraph Gremlin endpoint
- Elasticsearch endpoint
## Recommended isolated env
Do not install projector dependencies into system Python.
## Preferred: existing spark container on lakehouse-core
This reuses your existing `spark` container and Spark properties file.
Standard command (frozen):
```bash
./run-projector-standard.sh
```
Run by release name (no manifest path):
```bash
./run-projector-standard.sh --release-name rel_2026-02-14_docs-v1
```
Standard dry-run:
```bash
./run-projector-standard.sh --dry-run
```
Copy files to host:
```bash
rsync -av --delete /home/niklas/projects/jecio/ lakehouse-core.rakeroots.lan:/tmp/jecio/
```
Run dry-run projection inside `spark` container:
```bash
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-via-spark-container.sh ./manifests/rel_2026-02-14_docs-v1.json lake.db1.docs --dry-run es'
```
Run publish projection (writes Janus/ES):
```bash
ssh lakehouse-core.rakeroots.lan 'cd /tmp/jecio && ./run-projector-standard.sh'
```
`run-projector-via-spark-container.sh` uses:
- container: `spark` (override with `SPARK_CONTAINER_NAME`)
- properties file: `/opt/lakehouse/spark-conf/lakehouse-spark-defaults.conf` (override with `SPARK_PROPS`)
- Spark packages: Iceberg + Nessie extensions (override with `SPARK_PACKAGES`)
- arg4 `targets`: `es|gremlin|both` (default `both`)
- arg5 `release_name`: optional; if set, loads manifest from `releases_v2`
Direct projector usage:
```bash
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --release-name rel_2026-02-14_docs-v1 --concept-table lake.db1.docs --targets both
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets es --dry-run
python3 release_projector.py --manifest-file manifests/rel_2026-02-14_docs-v1.json --concept-table lake.db1.docs --targets both
```
Local setup (fallback):
```bash
./setup_local_env.sh .venv-projector
source .venv-projector/bin/activate
```
Remote setup (fallback, venv on `lakehouse-core`):
```bash
scp release_projector.py requirements-projector.txt manifests/rel_2026-02-14_docs-v1.json lakehouse-core.rakeroots.lan:/tmp/
ssh lakehouse-core.rakeroots.lan 'python3 -m venv /tmp/jecio-projector-venv && /tmp/jecio-projector-venv/bin/pip install --upgrade pip && /tmp/jecio-projector-venv/bin/pip install -r /tmp/requirements-projector.txt'
```
## Required env vars (example)
```bash
export NESSIE_URI=http://lakehouse-core:19120/api/v2
export NESSIE_WAREHOUSE=s3a://lakehouse/warehouse
export S3_ENDPOINT=http://lakehouse-core:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export GREMLIN_URL=ws://janus.rakeroots.lan:8182/gremlin
export ES_URL=http://janus.rakeroots.lan:9200
export ES_INDEX=concepts
```
## Run
```bash
/tmp/jecio-projector-venv/bin/python /tmp/release_projector.py \
--manifest-file /tmp/rel_2026-02-14_docs-v1.json \
--concept-table lake.db1.docs \
--dry-run
```
Or local:
```bash
python3 release_projector.py \
--manifest-file /path/to/release.json \
--concept-table lake.db1.concepts
```
If the manifest has a Nessie tag in fields like `nessie.tag`, you can omit `--nessie-ref`.
Dry run:
```bash
python3 release_projector.py \
--manifest-file /path/to/release.json \
--concept-table lake.db1.concepts \
--dry-run
```