docs: map pipeline concepts to implementation

fffoivos · fffoivos · commit 79cc99c237ba · 2026-03-10T16:23:45.000+02:00
diff --git a/README.md b/README.md
@@ -80,11 +80,28 @@ Use `dependency_setup/setup_glossapi.sh` for the Docling environment, or `depend
 See the refreshed docs (`docs/index.md`) for detailed environment notes, CUDA/ORT combinations, and troubleshooting tips.
 
 ## Repo Landmarks
+- `docs/code_map.md`: fast map from pipeline ideas to implementing classes and files.
+- `docs/pipeline.md`: stage contracts, key parameters, and artifact outputs.
 - `samples/lightweight_pdf_corpus/`: 20 one-page PDFs with manifest + expected Markdown.
 - `src/glossapi/`: Corpus pipeline, cleaners, and orchestration logic.
 - `tests/test_pipeline_smoke.py`: Minimal regression entry point (uses the lightweight corpus).
 - `docs/`: MkDocs site with onboarding, pipeline recipes, and configuration guides.
 
+## Pipeline map
+
+Use this as the shortest path from a documentation concept to the public call that implements it.
+
+| Stage | Main call | Important parameters | Writes |
+| --- | --- | --- | --- |
+| Download | `Corpus.download(...)` | `input_parquet`, `links_column`, `parallelize_by`, downloader kwargs | `downloads/`, `download_results/*.parquet` |
+| Extract (Phase-1) | `Corpus.extract(...)` | `input_format`, `phase1_backend`, `force_ocr`, `use_gpus`, `export_doc_json`, `emit_formula_index` | `markdown/<stem>.md`, `json/<stem>.docling.json(.zst)`, `json/metrics/*.json` |
+| Clean | `Corpus.clean(...)` | `threshold`, `drop_bad`, `empty_char_threshold`, `empty_min_pages` | `clean_markdown/<stem>.md`, updated parquet metrics/flags |
+| OCR / math follow-up | `Corpus.ocr(...)` | `mode`, `fix_bad`, `math_enhance`, `use_gpus`, `devices` | refreshed `markdown/<stem>.md`, optional `json/<stem>.latex_map.jsonl` |
+| Section | `Corpus.section()` | uses cleaner/parquet outputs to choose inputs | `sections/sections_for_annotation.parquet` |
+| Annotate | `Corpus.annotate(...)` | `annotation_type`, `fully_annotate` | `classified_sections.parquet`, `fully_annotated_sections.parquet` |
+| Triage math density | `Corpus.triage_math()` | no required args | updated `download_results/*.parquet` routing columns |
+| JSONL export | `Corpus.jsonl(...)` | `output_path` | merged training/export JSONL |
+
 ## Contributing
 - Run `pytest tests/test_pipeline_smoke.py` for a fast end-to-end check.
 - Regenerate the lightweight corpus via `generate_pdfs.py` and commit the updated PDFs + manifest together.
diff --git a/docs/api/corpus.md b/docs/api/corpus.md
@@ -0,0 +1,216 @@
+# API Reference — `glossapi.Corpus`
+
+The `Corpus` class is the high‑level entrypoint for the pipeline. Below are the most commonly used methods.
+
+Use this page as a compact contract reference. For the stage-by-stage artifact view, see `../pipeline.md`. For the source-level ownership map, see `../code_map.md`.
+
+## Constructor
+
+```python
+glossapi.Corpus(
+  input_dir: str | Path,
+  output_dir: str | Path,
+  section_classifier_model_path: str | Path | None = None,
+  extraction_model_path: str | Path | None = None,
+  metadata_path: str | Path | None = None,
+  annotation_mapping: dict[str, str] | None = None,
+  downloader_config: dict[str, Any] | None = None,
+  log_level: int = logging.INFO,
+  verbose: bool = False,
+)
+```
+
+- `input_dir`: source files (PDF/DOCX/HTML/…)
+- `output_dir`: pipeline outputs (markdown, json, sections, …)
+- `downloader_config`: defaults for `download()` (e.g., concurrency, cookies)
+- Main side effects: creates the standard output folders and lazy-initializes the extractor, sectioner, and classifier.
+
+## extract()
+
+```python
+extract(
+  input_format: str = 'all',
+  num_threads: int | None = None,
+  accel_type: str = 'CUDA',        # 'CPU'|'CUDA'|'MPS'|'Auto'
+  *,
+  force_ocr: bool = False,
+  formula_enrichment: bool = False,
+  code_enrichment: bool = False,
+  filenames: list[str] | None = None,
+  skip_existing: bool = True,
+  use_gpus: str = 'single',        # 'single'|'multi'
+  devices: list[int] | None = None,
+  use_cls: bool = False,
+  benchmark_mode: bool = False,
+  export_doc_json: bool = True,
+  emit_formula_index: bool = False,
+) -> None
+```
+
+- Purpose: Phase‑1 extraction from source files into markdown plus optional JSON intermediates.
+- Typical inputs:
+  - files already present in `downloads/`
+  - or explicit `file_paths`
+- Important parameters:
+  - `phase1_backend='safe'|'docling'|'auto'`: PyPDFium for stability vs Docling for native layout/OCR
+  - `force_ocr=True`: turn on OCR during extraction
+  - `use_gpus='multi'`: use all visible GPUs through a shared work queue
+  - `export_doc_json=True`: write `json/<stem>.docling.json(.zst)`
+  - `emit_formula_index=True`: also write `json/<stem>.formula_index.jsonl`
+- Main outputs:
+  - `markdown/<stem>.md`
+  - `json/<stem>.docling.json(.zst)` when enabled
+  - `json/metrics/<stem>.metrics.json`
+  - `json/metrics/<stem>.per_page.metrics.json`
+
+## clean()
+
+```python
+clean(
+  input_dir: str | Path | None = None,
+  threshold: float = 0.10,
+  num_threads: int | None = None,
+  drop_bad: bool = True,
+) -> None
+```
+
+- Purpose: run the Rust cleaner/noise pipeline and decide which documents are safe for downstream processing.
+- Typical inputs:
+  - `markdown/*.md`
+  - metadata parquet if present
+- Important parameters:
+  - `threshold`: badness threshold
+  - `drop_bad`: whether to remove bad files from downstream selection
+  - `empty_char_threshold`, `empty_min_pages`: heuristics for OCR rerun recommendation
+- Main outputs:
+  - `clean_markdown/<stem>.md`
+  - cleaner report parquet
+  - updated parquet columns such as `filter`, `needs_ocr`, and metrics fields
+- Operational note: this stage is the quality gate that drives `section()` and `ocr()`.
+
+## ocr()
+
+```python
+ocr(
+  *,
+  fix_bad: bool = True,
+  mode: str | None = None,
+  device: str | None = None,
+  model_dir: str | Path | None = None,
+  max_pages: int | None = None,
+  persist_engine: bool = True,
+  limit: int | None = None,
+  dpi: int | None = None,
+  precision: str | None = None,
+  math_enhance: bool = True,
+  math_targets: dict[str, list[tuple[int,int]]] | None = None,
+  math_batch_size: int = 8,
+  math_dpi_base: int = 220,
+  use_gpus: str = 'single',
+  devices: list[int] | None = None,
+  force: bool | None = None,
+) -> None
+```
+
+- Purpose: selective OCR retry and optional Phase‑2 math/code enrichment.
+- Mode selection:
+  - `ocr_bad`: rerun OCR only for cleaner-flagged docs
+  - `math_only`: run enrichment from existing Docling JSON
+  - `ocr_bad_then_math`: OCR flagged docs, then enrich them
+- Important parameters:
+  - `mode`, `fix_bad`, `math_enhance`
+  - `use_gpus`, `devices`
+  - `math_targets` to restrict enrichment to specific items
+- Main outputs:
+  - refreshed `markdown/<stem>.md`
+  - refreshed cleaner/parquet metadata after OCR reruns
+  - `json/<stem>.latex_map.jsonl` when enrichment runs
+
+## formula_enrich_from_json()
+
+```python
+formula_enrich_from_json(
+  files: list[str] | None = None,
+  *,
+  device: str = 'cuda',
+  batch_size: int = 8,
+  dpi_base: int = 220,
+  targets_by_stem: dict[str, list[tuple[int,int]]] | None = None,
+) -> None
+```
+
+- Purpose: Phase‑2 GPU enrichment from previously exported Docling JSON.
+- Typical inputs:
+  - `json/<stem>.docling.json(.zst)`
+  - optional formula/code index data
+- Important parameters:
+  - `files`: restrict to specific stems
+  - `device`, `batch_size`, `dpi_base`
+  - `targets_by_stem`: target specific `(page_no, item_index)` tuples
+- Main outputs:
+  - enriched markdown back into `markdown/<stem>.md`
+  - `json/<stem>.latex_map.jsonl`
+
+## section(), annotate()
+
+```python
+section() -> None
+annotate(annotation_type: str = 'text', fully_annotate: bool = True) -> None
+```
+
+- `section()`:
+  - purpose: convert markdown into one row per section with structural flags
+  - inputs: markdown selected by cleaner/parquet metadata
+  - outputs: `sections/sections_for_annotation.parquet`
+- `annotate()`:
+  - purpose: classify sections and optionally expand them into full document structure
+  - important parameters: `annotation_type='text'|'chapter'|'auto'`, `fully_annotate`
+  - outputs: `classified_sections.parquet` and `fully_annotated_sections.parquet`
+
+## download()
+
+```python
+download(
+  input_parquet: str | Path,
+  *,
+  links_column: str | None = None,
+  parallelize_by: str | None = None,
+  verbose: bool | None = None,
+  **kwargs,
+) -> pd.DataFrame
+```
+
+- Purpose: fetch source files described in a parquet dataset.
+- Typical inputs:
+  - an explicit `input_parquet`
+  - or the first parquet file found in `input_dir`
+- Important parameters:
+  - `links_column`: override URL column name
+  - `parallelize_by`: choose grouping for the scheduler
+  - downloader kwargs via `**kwargs` for concurrency, SSL, cookies, retries, checkpoints, etc.
+- Main outputs:
+  - downloaded files in `downloads/`
+  - partial/final results in `download_results/`
+  - returned `pd.DataFrame` with download status and metadata
+
+## triage_math()
+
+- Purpose: summarize per-page metrics and recommend Phase‑2 for math-dense docs.
+- Inputs: `json/metrics/<stem>.per_page.metrics.json`
+- Outputs: updated `download_results` parquet with routing fields such as formula totals and phase recommendation
+
+## Suggested Reading Order
+
+1. `download()` if you start from URLs.
+2. `extract()` for Phase‑1 layout/markdown.
+3. `clean()` to decide what needs OCR.
+4. `ocr()` if you need OCR retry or Phase‑2 enrichment.
+5. `section()` and `annotate()` for structured downstream outputs.
+
+---
+
+See also:
+- Code map: ../code_map.md
+- Pipeline overview and artifacts: ../pipeline.md
+- Configuration and environment variables: ../configuration.md
+- OCR and math enrichment details: ../ocr_and_math_enhancement.md
diff --git a/docs/code_map.md b/docs/code_map.md
@@ -0,0 +1,61 @@
+# Code Map
+
+This page maps the main documentation ideas to the code that implements them. It is
+meant to help you move from "what does GlossAPI do?" to "where do I change it?"
+without reading the entire repo.
+
+## Top-Level Entry Points
+
+| Area | Main code | Responsibility |
+| --- | --- | --- |
+| Public package entry | `src/glossapi/__init__.py` | Applies the RapidOCR patch on import and exports `Corpus`, `GlossSectionClassifier`, `GlossDownloader`, and related classes. |
+| High-level orchestration | `src/glossapi/corpus.py` | Coordinates the end-to-end pipeline and owns the main folder/artifact conventions. |
+| Phase-1 extraction engine | `src/glossapi/gloss_extract.py` | Builds/reuses Docling converters, handles safe vs Docling backend selection, batching, timeouts, resumption, and artifact export. |
+
+## Pipeline Stages
+
+| Stage | Main methods/classes | Notes |
+| --- | --- | --- |
+| Download | `Corpus.download()`, `GlossDownloader.download_files()` | Supports URL expansion, deduplication, checkpoints, per-domain scheduling, and resume. |
+| Extract | `Corpus.prime_extractor()`, `Corpus.extract()`, `GlossExtract.ensure_extractor()`, `GlossExtract.extract_path()` | Handles backend choice, GPU preflight, and single- vs multi-GPU dispatch. |
+| Clean / quality gate | `Corpus.clean()` | Runs the Rust cleaner and merges quality metrics back into parquet metadata. |
+| OCR retry / math follow-up | `Corpus.ocr()`, `Corpus.formula_enrich_from_json()` | Re-runs OCR only for flagged documents and optionally performs Phase-2 math/code enrichment from JSON. |
+| Sectioning | `Corpus.section()`, `GlossSection.to_parquet()` | Converts markdown documents into section rows for later classification. |
+| Classification / annotation | `Corpus.annotate()`, `GlossSectionClassifier.classify_sections()`, `GlossSectionClassifier.fully_annotate()` | Runs the SVM classifier and post-processes section labels into final document structure. |
+| Export / triage | `Corpus.jsonl()`, `Corpus.triage_math()` | Produces training/export JSONL and computes routing hints for math-dense documents. |
+
+## Backend and Runtime Helpers
+
+| File | Responsibility |
+| --- | --- |
+| `src/glossapi/_pipeline.py` | Canonical builders for layout-only and RapidOCR-backed Docling pipelines. |
+| `src/glossapi/rapidocr_safe.py` | Monkey-patch/shim for Docling 2.48.x so problematic OCR crops do not crash whole documents. |
+| `src/glossapi/_rapidocr_paths.py` | Resolves packaged RapidOCR ONNX models and Greek keys, with env-var override support. |
+| `src/glossapi/ocr_pool.py` | Reuses RapidOCR model instances where possible. |
+| `src/glossapi/json_io.py` | Writes and reads compressed Docling JSON artifacts. |
+| `src/glossapi/triage.py` | Summarizes per-page formula density and updates parquet routing metadata. |
+| `src/glossapi/metrics.py` | Computes per-page parse/OCR/formula metrics from Docling conversions. |
+
+## Rust Extensions
+
+| Crate | Path | Purpose |
+| --- | --- | --- |
+| Cleaner | `rust/glossapi_rs_cleaner` | Markdown cleaning, script/noise filtering, and report generation used by `Corpus.clean()`. |
+| Noise metrics | `rust/glossapi_rs_noise` | Fast quality metrics used by the broader pipeline and package build configuration. |
+
+## Tests To Read First
+
+| Test | Why it matters |
+| --- | --- |
+| `tests/test_pipeline_smoke.py` | Best high-level example of the intended artifact flow through extract -> clean -> OCR -> section. |
+| `tests/test_corpus_guards.py` | Shows the contract around backend selection and GPU preflight. |
+| `tests/test_jsonl_export.py` | Shows how final JSONL export merges cleaned markdown, parquet metadata, and math metrics. |
+| `tests/test_rapidocr_patch.py` | Covers the Docling/RapidOCR compatibility patch and fallback paths. |
+
+## If You Need To Change...
+
+- Download scheduling or resume behavior: start in `src/glossapi/gloss_downloader.py`.
+- Phase-1 parsing, OCR selection, or artifact generation: start in `src/glossapi/corpus.py` and `src/glossapi/gloss_extract.py`.
+- Docling/RapidOCR wiring or provider issues: start in `src/glossapi/_pipeline.py`, `src/glossapi/rapidocr_safe.py`, and `src/glossapi/_rapidocr_paths.py`.
+- Section labels or section-annotation rules: start in `src/glossapi/gloss_section_classifier.py`.
+- Output folder contracts or stage sequencing: start in `src/glossapi/corpus.py`.
diff --git a/docs/index.md b/docs/index.md
@@ -8,6 +8,7 @@ Welcome to the refreshed docs for GlossAPI, the GFOSS pipeline for turning acade
 - [Lightweight PDF Corpus](lightweight_corpus.md) — 20 one-page PDFs for smoke testing without Docling or GPUs.
 
 ## Learn the pipeline
+- [Code Map](code_map.md) links the main documentation ideas to the classes and files that implement them.
 - [Pipeline Overview](pipeline.md) explains each stage and the emitted artifacts.
 - [OCR & Math Enrichment](ocr_and_math_enhancement.md) covers DeepSeek OCR remediation and Docling-based enrichment.
 - [Multi-GPU & Benchmarking](multi_gpu.md) shares scaling and scheduling tips.
@@ -18,4 +19,5 @@ Welcome to the refreshed docs for GlossAPI, the GFOSS pipeline for turning acade
 - [AWS Job Distribution](aws_job_distribution.md) describes large-scale scheduling.
 
 ## Reference
-- [Corpus API](api_corpus_tmp.md) details public methods and parameters.
+- [Corpus API](api/corpus.md) gives the compact contract view of the main public methods.
+- [Legacy Corpus API Notes](api_corpus_tmp.md) remains available while the docs are being consolidated.
diff --git a/docs/pipeline.md b/docs/pipeline.md
diff --git a/mkdocs.yml b/mkdocs.yml