This runbook covers the implementation that was added for:
src/prert/.src/prert/extract/gdpr_parser.pysrc/prert/extract/iso_parser.pysrc/prert/extract/nist_parser.pysrc/prert/chunking/line_chunker.pysrc/prert/chroma/client.py (SDK first, OpenAPI fallback)src/prert/chroma/schema.py (Qwen dense + Splade sparse)src/prert/chroma/search.py (dense/sparse/hybrid builders)src/prert/cli/extract.pysrc/prert/cli/migrate.pyscripts/extract_phase1_controls.pyscripts/migrate_to_chroma.pytests/test_extractors.pytests/test_chunking.pyFrom repo root:
python -m pip install -e .
Optional dev tools:
python -m pip install -e .[dev]
PYTHONPATH=src python scripts/extract_phase1_controls.py \
--chunk \
--output-dir artifacts/phase-1
Expected outputs:
artifacts/phase-1/controls_gdpr.jsonlartifacts/phase-1/controls_iso27001.jsonlartifacts/phase-1/controls_nistpf.jsonlartifacts/phase-1/chunks_gdpr.jsonlartifacts/phase-1/chunks_iso27001.jsonlartifacts/phase-1/chunks_nistpf.jsonlcontrols_all.jsonl and chunks_all.jsonlPYTHONPATH=src python scripts/migrate_to_chroma.py \
--input-dir artifacts/phase-1 \
--dry-run
This verifies collection sharding and row counts without writing to cloud.
PYTHONPATH=src python scripts/migrate_to_chroma.py \
--input-dir artifacts/phase-1
Default collection shards:
gdpr_controlsiso27001_controlsnist_controlsOptional prefix:
PYTHONPATH=src python scripts/migrate_to_chroma.py \
--input-dir artifacts/phase-1 \
--collection-prefix prert_
Run tests:
PYTHONPATH=src python -m pytest tests -q
pie showData
title Phase 1 Controls by Regulation (n=237)
"GDPR" : 103
"ISO 27001" : 68
"NIST PF 1.1" : 66
| Figure | What it shows | Result |
|---|---|---|
| Phase 1 Extraction Composition | Control distribution across source standards | GDPR contributes the largest share while ISO and NIST remain balanced. |
For the full multi-figure dashboard (progress + quality indicators), see 09-phase1-phase2-progress-dashboard.md.
For improved doc search while developing, see:
Use it as a companion for docs lookup; runtime ingestion/search in this repo remains implemented via Python SDK + OpenAPI fallback.
source_document_idcontrol_idchunk_indexregulation| ⬅ Back | Next ⮕ |