Skip to content

Commit 79cc99c

Browse files
committed
docs: map pipeline concepts to implementation
1 parent 1bf4261 commit 79cc99c

File tree

6 files changed

+395
-12
lines changed

6 files changed

+395
-12
lines changed

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,11 +80,28 @@ Use `dependency_setup/setup_glossapi.sh` for the Docling environment, or `depend
8080
See the refreshed docs (`docs/index.md`) for detailed environment notes, CUDA/ORT combinations, and troubleshooting tips.
8181

8282
## Repo Landmarks
83+
- `docs/code_map.md`: fast map from pipeline ideas to implementing classes and files.
84+
- `docs/pipeline.md`: stage contracts, key parameters, and artifact outputs.
8385
- `samples/lightweight_pdf_corpus/`: 20 one-page PDFs with manifest + expected Markdown.
8486
- `src/glossapi/`: Corpus pipeline, cleaners, and orchestration logic.
8587
- `tests/test_pipeline_smoke.py`: Minimal regression entry point (uses the lightweight corpus).
8688
- `docs/`: MkDocs site with onboarding, pipeline recipes, and configuration guides.
8789

90+
## Pipeline map
91+
92+
Use this as the shortest path from a documentation concept to the public call that implements it.
93+
94+
| Stage | Main call | Important parameters | Writes |
95+
| --- | --- | --- | --- |
96+
| Download | `Corpus.download(...)` | `input_parquet`, `links_column`, `parallelize_by`, downloader kwargs | `downloads/`, `download_results/*.parquet` |
97+
| Extract (Phase-1) | `Corpus.extract(...)` | `input_format`, `phase1_backend`, `force_ocr`, `use_gpus`, `export_doc_json`, `emit_formula_index` | `markdown/<stem>.md`, `json/<stem>.docling.json(.zst)`, `json/metrics/*.json` |
98+
| Clean | `Corpus.clean(...)` | `threshold`, `drop_bad`, `empty_char_threshold`, `empty_min_pages` | `clean_markdown/<stem>.md`, updated parquet metrics/flags |
99+
| OCR / math follow-up | `Corpus.ocr(...)` | `mode`, `fix_bad`, `math_enhance`, `use_gpus`, `devices` | refreshed `markdown/<stem>.md`, optional `json/<stem>.latex_map.jsonl` |
100+
| Section | `Corpus.section()` | uses cleaner/parquet outputs to choose inputs | `sections/sections_for_annotation.parquet` |
101+
| Annotate | `Corpus.annotate(...)` | `annotation_type`, `fully_annotate` | `classified_sections.parquet`, `fully_annotated_sections.parquet` |
102+
| Triage math density | `Corpus.triage_math()` | no required args | updated `download_results/*.parquet` routing columns |
103+
| JSONL export | `Corpus.jsonl(...)` | `output_path` | merged training/export JSONL |
104+
88105
## Contributing
89106
- Run `pytest tests/test_pipeline_smoke.py` for a fast end-to-end check.
90107
- Regenerate the lightweight corpus via `generate_pdfs.py` and commit the updated PDFs + manifest together.

docs/api/corpus.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# API Reference — `glossapi.Corpus`
2+
3+
The `Corpus` class is the high‑level entrypoint for the pipeline. Below are the most commonly used methods.
4+
5+
Use this page as a compact contract reference. For the stage-by-stage artifact view, see `../pipeline.md`. For the source-level ownership map, see `../code_map.md`.
6+
7+
## Constructor
8+
9+
```python
10+
glossapi.Corpus(
11+
input_dir: str | Path,
12+
output_dir: str | Path,
13+
section_classifier_model_path: str | Path | None = None,
14+
extraction_model_path: str | Path | None = None,
15+
metadata_path: str | Path | None = None,
16+
annotation_mapping: dict[str, str] | None = None,
17+
downloader_config: dict[str, Any] | None = None,
18+
log_level: int = logging.INFO,
19+
verbose: bool = False,
20+
)
21+
```
22+
23+
- `input_dir`: source files (PDF/DOCX/HTML/…)
24+
- `output_dir`: pipeline outputs (markdown, json, sections, …)
25+
- `downloader_config`: defaults for `download()` (e.g., concurrency, cookies)
26+
- Main side effects: creates the standard output folders and lazy-initializes the extractor, sectioner, and classifier.
27+
28+
## extract()
29+
30+
```python
31+
extract(
32+
input_format: str = 'all',
33+
num_threads: int | None = None,
34+
accel_type: str = 'CUDA', # 'CPU'|'CUDA'|'MPS'|'Auto'
35+
*,
36+
force_ocr: bool = False,
37+
formula_enrichment: bool = False,
38+
code_enrichment: bool = False,
39+
filenames: list[str] | None = None,
40+
skip_existing: bool = True,
41+
use_gpus: str = 'single', # 'single'|'multi'
42+
devices: list[int] | None = None,
43+
use_cls: bool = False,
44+
benchmark_mode: bool = False,
45+
export_doc_json: bool = True,
46+
emit_formula_index: bool = False,
47+
) -> None
48+
```
49+
50+
- Purpose: Phase‑1 extraction from source files into markdown plus optional JSON intermediates.
51+
- Typical inputs:
52+
- files already present in `downloads/`
53+
- or explicit `file_paths`
54+
- Important parameters:
55+
- `phase1_backend='safe'|'docling'|'auto'`: PyPDFium for stability vs Docling for native layout/OCR
56+
- `force_ocr=True`: turn on OCR during extraction
57+
- `use_gpus='multi'`: use all visible GPUs through a shared work queue
58+
- `export_doc_json=True`: write `json/<stem>.docling.json(.zst)`
59+
- `emit_formula_index=True`: also write `json/<stem>.formula_index.jsonl`
60+
- Main outputs:
61+
- `markdown/<stem>.md`
62+
- `json/<stem>.docling.json(.zst)` when enabled
63+
- `json/metrics/<stem>.metrics.json`
64+
- `json/metrics/<stem>.per_page.metrics.json`
65+
66+
## clean()
67+
68+
```python
69+
clean(
70+
input_dir: str | Path | None = None,
71+
threshold: float = 0.10,
72+
num_threads: int | None = None,
73+
drop_bad: bool = True,
74+
) -> None
75+
```
76+
77+
- Purpose: run the Rust cleaner/noise pipeline and decide which documents are safe for downstream processing.
78+
- Typical inputs:
79+
- `markdown/*.md`
80+
- metadata parquet if present
81+
- Important parameters:
82+
- `threshold`: badness threshold
83+
- `drop_bad`: whether to remove bad files from downstream selection
84+
- `empty_char_threshold`, `empty_min_pages`: heuristics for OCR rerun recommendation
85+
- Main outputs:
86+
- `clean_markdown/<stem>.md`
87+
- cleaner report parquet
88+
- updated parquet columns such as `filter`, `needs_ocr`, and metrics fields
89+
- Operational note: this stage is the quality gate that drives `section()` and `ocr()`.
90+
91+
## ocr()
92+
93+
```python
94+
ocr(
95+
*,
96+
fix_bad: bool = True,
97+
mode: str | None = None,
98+
device: str | None = None,
99+
model_dir: str | Path | None = None,
100+
max_pages: int | None = None,
101+
persist_engine: bool = True,
102+
limit: int | None = None,
103+
dpi: int | None = None,
104+
precision: str | None = None,
105+
math_enhance: bool = True,
106+
math_targets: dict[str, list[tuple[int,int]]] | None = None,
107+
math_batch_size: int = 8,
108+
math_dpi_base: int = 220,
109+
use_gpus: str = 'single',
110+
devices: list[int] | None = None,
111+
force: bool | None = None,
112+
) -> None
113+
```
114+
115+
- Purpose: selective OCR retry and optional Phase‑2 math/code enrichment.
116+
- Mode selection:
117+
- `ocr_bad`: rerun OCR only for cleaner-flagged docs
118+
- `math_only`: run enrichment from existing Docling JSON
119+
- `ocr_bad_then_math`: OCR flagged docs, then enrich them
120+
- Important parameters:
121+
- `mode`, `fix_bad`, `math_enhance`
122+
- `use_gpus`, `devices`
123+
- `math_targets` to restrict enrichment to specific items
124+
- Main outputs:
125+
- refreshed `markdown/<stem>.md`
126+
- refreshed cleaner/parquet metadata after OCR reruns
127+
- `json/<stem>.latex_map.jsonl` when enrichment runs
128+
129+
## formula_enrich_from_json()
130+
131+
```python
132+
formula_enrich_from_json(
133+
files: list[str] | None = None,
134+
*,
135+
device: str = 'cuda',
136+
batch_size: int = 8,
137+
dpi_base: int = 220,
138+
targets_by_stem: dict[str, list[tuple[int,int]]] | None = None,
139+
) -> None
140+
```
141+
142+
- Purpose: Phase‑2 GPU enrichment from previously exported Docling JSON.
143+
- Typical inputs:
144+
- `json/<stem>.docling.json(.zst)`
145+
- optional formula/code index data
146+
- Important parameters:
147+
- `files`: restrict to specific stems
148+
- `device`, `batch_size`, `dpi_base`
149+
- `targets_by_stem`: target specific `(page_no, item_index)` tuples
150+
- Main outputs:
151+
- enriched markdown back into `markdown/<stem>.md`
152+
- `json/<stem>.latex_map.jsonl`
153+
154+
## section(), annotate()
155+
156+
```python
157+
section() -> None
158+
annotate(annotation_type: str = 'text', fully_annotate: bool = True) -> None
159+
```
160+
161+
- `section()`:
162+
- purpose: convert markdown into one row per section with structural flags
163+
- inputs: markdown selected by cleaner/parquet metadata
164+
- outputs: `sections/sections_for_annotation.parquet`
165+
- `annotate()`:
166+
- purpose: classify sections and optionally expand them into full document structure
167+
- important parameters: `annotation_type='text'|'chapter'|'auto'`, `fully_annotate`
168+
- outputs: `classified_sections.parquet` and `fully_annotated_sections.parquet`
169+
170+
## download()
171+
172+
```python
173+
download(
174+
input_parquet: str | Path,
175+
*,
176+
links_column: str | None = None,
177+
parallelize_by: str | None = None,
178+
verbose: bool | None = None,
179+
**kwargs,
180+
) -> pd.DataFrame
181+
```
182+
183+
- Purpose: fetch source files described in a parquet dataset.
184+
- Typical inputs:
185+
- an explicit `input_parquet`
186+
- or the first parquet file found in `input_dir`
187+
- Important parameters:
188+
- `links_column`: override URL column name
189+
- `parallelize_by`: choose grouping for the scheduler
190+
- downloader kwargs via `**kwargs` for concurrency, SSL, cookies, retries, checkpoints, etc.
191+
- Main outputs:
192+
- downloaded files in `downloads/`
193+
- partial/final results in `download_results/`
194+
- returned `pd.DataFrame` with download status and metadata
195+
196+
## triage_math()
197+
198+
- Purpose: summarize per-page metrics and recommend Phase‑2 for math-dense docs.
199+
- Inputs: `json/metrics/<stem>.per_page.metrics.json`
200+
- Outputs: updated `download_results` parquet with routing fields such as formula totals and phase recommendation
201+
202+
## Suggested Reading Order
203+
204+
1. `download()` if you start from URLs.
205+
2. `extract()` for Phase‑1 layout/markdown.
206+
3. `clean()` to decide what needs OCR.
207+
4. `ocr()` if you need OCR retry or Phase‑2 enrichment.
208+
5. `section()` and `annotate()` for structured downstream outputs.
209+
210+
---
211+
212+
See also:
213+
- Code map: ../code_map.md
214+
- Pipeline overview and artifacts: ../pipeline.md
215+
- Configuration and environment variables: ../configuration.md
216+
- OCR and math enrichment details: ../ocr_and_math_enhancement.md

docs/code_map.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Code Map
2+
3+
This page maps the main documentation ideas to the code that implements them. It is
4+
meant to help you move from "what does GlossAPI do?" to "where do I change it?"
5+
without reading the entire repo.
6+
7+
## Top-Level Entry Points
8+
9+
| Area | Main code | Responsibility |
10+
| --- | --- | --- |
11+
| Public package entry | `src/glossapi/__init__.py` | Applies the RapidOCR patch on import and exports `Corpus`, `GlossSectionClassifier`, `GlossDownloader`, and related classes. |
12+
| High-level orchestration | `src/glossapi/corpus.py` | Coordinates the end-to-end pipeline and owns the main folder/artifact conventions. |
13+
| Phase-1 extraction engine | `src/glossapi/gloss_extract.py` | Builds/reuses Docling converters, handles safe vs Docling backend selection, batching, timeouts, resumption, and artifact export. |
14+
15+
## Pipeline Stages
16+
17+
| Stage | Main methods/classes | Notes |
18+
| --- | --- | --- |
19+
| Download | `Corpus.download()`, `GlossDownloader.download_files()` | Supports URL expansion, deduplication, checkpoints, per-domain scheduling, and resume. |
20+
| Extract | `Corpus.prime_extractor()`, `Corpus.extract()`, `GlossExtract.ensure_extractor()`, `GlossExtract.extract_path()` | Handles backend choice, GPU preflight, and single- vs multi-GPU dispatch. |
21+
| Clean / quality gate | `Corpus.clean()` | Runs the Rust cleaner and merges quality metrics back into parquet metadata. |
22+
| OCR retry / math follow-up | `Corpus.ocr()`, `Corpus.formula_enrich_from_json()` | Re-runs OCR only for flagged documents and optionally performs Phase-2 math/code enrichment from JSON. |
23+
| Sectioning | `Corpus.section()`, `GlossSection.to_parquet()` | Converts markdown documents into section rows for later classification. |
24+
| Classification / annotation | `Corpus.annotate()`, `GlossSectionClassifier.classify_sections()`, `GlossSectionClassifier.fully_annotate()` | Runs the SVM classifier and post-processes section labels into final document structure. |
25+
| Export / triage | `Corpus.jsonl()`, `Corpus.triage_math()` | Produces training/export JSONL and computes routing hints for math-dense documents. |
26+
27+
## Backend and Runtime Helpers
28+
29+
| File | Responsibility |
30+
| --- | --- |
31+
| `src/glossapi/_pipeline.py` | Canonical builders for layout-only and RapidOCR-backed Docling pipelines. |
32+
| `src/glossapi/rapidocr_safe.py` | Monkey-patch/shim for Docling 2.48.x so problematic OCR crops do not crash whole documents. |
33+
| `src/glossapi/_rapidocr_paths.py` | Resolves packaged RapidOCR ONNX models and Greek keys, with env-var override support. |
34+
| `src/glossapi/ocr_pool.py` | Reuses RapidOCR model instances where possible. |
35+
| `src/glossapi/json_io.py` | Writes and reads compressed Docling JSON artifacts. |
36+
| `src/glossapi/triage.py` | Summarizes per-page formula density and updates parquet routing metadata. |
37+
| `src/glossapi/metrics.py` | Computes per-page parse/OCR/formula metrics from Docling conversions. |
38+
39+
## Rust Extensions
40+
41+
| Crate | Path | Purpose |
42+
| --- | --- | --- |
43+
| Cleaner | `rust/glossapi_rs_cleaner` | Markdown cleaning, script/noise filtering, and report generation used by `Corpus.clean()`. |
44+
| Noise metrics | `rust/glossapi_rs_noise` | Fast quality metrics used by the broader pipeline and package build configuration. |
45+
46+
## Tests To Read First
47+
48+
| Test | Why it matters |
49+
| --- | --- |
50+
| `tests/test_pipeline_smoke.py` | Best high-level example of the intended artifact flow through extract -> clean -> OCR -> section. |
51+
| `tests/test_corpus_guards.py` | Shows the contract around backend selection and GPU preflight. |
52+
| `tests/test_jsonl_export.py` | Shows how final JSONL export merges cleaned markdown, parquet metadata, and math metrics. |
53+
| `tests/test_rapidocr_patch.py` | Covers the Docling/RapidOCR compatibility patch and fallback paths. |
54+
55+
## If You Need To Change...
56+
57+
- Download scheduling or resume behavior: start in `src/glossapi/gloss_downloader.py`.
58+
- Phase-1 parsing, OCR selection, or artifact generation: start in `src/glossapi/corpus.py` and `src/glossapi/gloss_extract.py`.
59+
- Docling/RapidOCR wiring or provider issues: start in `src/glossapi/_pipeline.py`, `src/glossapi/rapidocr_safe.py`, and `src/glossapi/_rapidocr_paths.py`.
60+
- Section labels or section-annotation rules: start in `src/glossapi/gloss_section_classifier.py`.
61+
- Output folder contracts or stage sequencing: start in `src/glossapi/corpus.py`.

docs/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Welcome to the refreshed docs for GlossAPI, the GFOSS pipeline for turning acade
88
- [Lightweight PDF Corpus](lightweight_corpus.md) — 20 one-page PDFs for smoke testing without Docling or GPUs.
99

1010
## Learn the pipeline
11+
- [Code Map](code_map.md) links the main documentation ideas to the classes and files that implement them.
1112
- [Pipeline Overview](pipeline.md) explains each stage and the emitted artifacts.
1213
- [OCR & Math Enrichment](ocr_and_math_enhancement.md) covers DeepSeek OCR remediation and Docling-based enrichment.
1314
- [Multi-GPU & Benchmarking](multi_gpu.md) shares scaling and scheduling tips.
@@ -18,4 +19,5 @@ Welcome to the refreshed docs for GlossAPI, the GFOSS pipeline for turning acade
1819
- [AWS Job Distribution](aws_job_distribution.md) describes large-scale scheduling.
1920

2021
## Reference
21-
- [Corpus API](api_corpus_tmp.md) details public methods and parameters.
22+
- [Corpus API](api/corpus.md) gives the compact contract view of the main public methods.
23+
- [Legacy Corpus API Notes](api_corpus_tmp.md) remains available while the docs are being consolidated.

0 commit comments

Comments
 (0)