|
| 1 | +# API Reference — `glossapi.Corpus` |
| 2 | + |
| 3 | +The `Corpus` class is the high‑level entrypoint for the pipeline. Below are the most commonly used methods. |
| 4 | + |
| 5 | +Use this page as a compact contract reference. For the stage-by-stage artifact view, see `../pipeline.md`. For the source-level ownership map, see `../code_map.md`. |
| 6 | + |
| 7 | +## Constructor |
| 8 | + |
| 9 | +```python |
| 10 | +glossapi.Corpus( |
| 11 | + input_dir: str | Path, |
| 12 | + output_dir: str | Path, |
| 13 | + section_classifier_model_path: str | Path | None = None, |
| 14 | + extraction_model_path: str | Path | None = None, |
| 15 | + metadata_path: str | Path | None = None, |
| 16 | + annotation_mapping: dict[str, str] | None = None, |
| 17 | + downloader_config: dict[str, Any] | None = None, |
| 18 | + log_level: int = logging.INFO, |
| 19 | + verbose: bool = False, |
| 20 | +) |
| 21 | +``` |
| 22 | + |
| 23 | +- `input_dir`: source files (PDF/DOCX/HTML/…) |
| 24 | +- `output_dir`: pipeline outputs (markdown, json, sections, …) |
| 25 | +- `downloader_config`: defaults for `download()` (e.g., concurrency, cookies) |
| 26 | +- Main side effects: creates the standard output folders and lazy-initializes the extractor, sectioner, and classifier. |
| 27 | + |
| 28 | +## extract() |
| 29 | + |
| 30 | +```python |
| 31 | +extract( |
| 32 | + input_format: str = 'all', |
| 33 | + num_threads: int | None = None, |
| 34 | + accel_type: str = 'CUDA', # 'CPU'|'CUDA'|'MPS'|'Auto' |
| 35 | + *, |
| 36 | + force_ocr: bool = False, |
| 37 | + formula_enrichment: bool = False, |
| 38 | + code_enrichment: bool = False, |
| 39 | + filenames: list[str] | None = None, |
| 40 | + skip_existing: bool = True, |
| 41 | + use_gpus: str = 'single', # 'single'|'multi' |
| 42 | + devices: list[int] | None = None, |
| 43 | + use_cls: bool = False, |
| 44 | + benchmark_mode: bool = False, |
| 45 | + export_doc_json: bool = True, |
| 46 | + emit_formula_index: bool = False, |
| 47 | +) -> None |
| 48 | +``` |
| 49 | + |
| 50 | +- Purpose: Phase‑1 extraction from source files into markdown plus optional JSON intermediates. |
| 51 | +- Typical inputs: |
| 52 | + - files already present in `downloads/` |
| 53 | + - or explicit `file_paths` |
| 54 | +- Important parameters: |
| 55 | + - `phase1_backend='safe'|'docling'|'auto'`: PyPDFium for stability vs Docling for native layout/OCR |
| 56 | + - `force_ocr=True`: turn on OCR during extraction |
| 57 | + - `use_gpus='multi'`: use all visible GPUs through a shared work queue |
| 58 | + - `export_doc_json=True`: write `json/<stem>.docling.json(.zst)` |
| 59 | + - `emit_formula_index=True`: also write `json/<stem>.formula_index.jsonl` |
| 60 | +- Main outputs: |
| 61 | + - `markdown/<stem>.md` |
| 62 | + - `json/<stem>.docling.json(.zst)` when enabled |
| 63 | + - `json/metrics/<stem>.metrics.json` |
| 64 | + - `json/metrics/<stem>.per_page.metrics.json` |
| 65 | + |
| 66 | +## clean() |
| 67 | + |
| 68 | +```python |
| 69 | +clean( |
| 70 | + input_dir: str | Path | None = None, |
| 71 | + threshold: float = 0.10, |
| 72 | + num_threads: int | None = None, |
| 73 | + drop_bad: bool = True, |
| 74 | +) -> None |
| 75 | +``` |
| 76 | + |
| 77 | +- Purpose: run the Rust cleaner/noise pipeline and decide which documents are safe for downstream processing. |
| 78 | +- Typical inputs: |
| 79 | + - `markdown/*.md` |
| 80 | + - metadata parquet if present |
| 81 | +- Important parameters: |
| 82 | + - `threshold`: badness threshold |
| 83 | + - `drop_bad`: whether to remove bad files from downstream selection |
| 84 | + - `empty_char_threshold`, `empty_min_pages`: heuristics for OCR rerun recommendation |
| 85 | +- Main outputs: |
| 86 | + - `clean_markdown/<stem>.md` |
| 87 | + - cleaner report parquet |
| 88 | + - updated parquet columns such as `filter`, `needs_ocr`, and metrics fields |
| 89 | +- Operational note: this stage is the quality gate that drives `section()` and `ocr()`. |
| 90 | + |
| 91 | +## ocr() |
| 92 | + |
| 93 | +```python |
| 94 | +ocr( |
| 95 | + *, |
| 96 | + fix_bad: bool = True, |
| 97 | + mode: str | None = None, |
| 98 | + device: str | None = None, |
| 99 | + model_dir: str | Path | None = None, |
| 100 | + max_pages: int | None = None, |
| 101 | + persist_engine: bool = True, |
| 102 | + limit: int | None = None, |
| 103 | + dpi: int | None = None, |
| 104 | + precision: str | None = None, |
| 105 | + math_enhance: bool = True, |
| 106 | + math_targets: dict[str, list[tuple[int,int]]] | None = None, |
| 107 | + math_batch_size: int = 8, |
| 108 | + math_dpi_base: int = 220, |
| 109 | + use_gpus: str = 'single', |
| 110 | + devices: list[int] | None = None, |
| 111 | + force: bool | None = None, |
| 112 | +) -> None |
| 113 | +``` |
| 114 | + |
| 115 | +- Purpose: selective OCR retry and optional Phase‑2 math/code enrichment. |
| 116 | +- Mode selection: |
| 117 | + - `ocr_bad`: rerun OCR only for cleaner-flagged docs |
| 118 | + - `math_only`: run enrichment from existing Docling JSON |
| 119 | + - `ocr_bad_then_math`: OCR flagged docs, then enrich them |
| 120 | +- Important parameters: |
| 121 | + - `mode`, `fix_bad`, `math_enhance` |
| 122 | + - `use_gpus`, `devices` |
| 123 | + - `math_targets` to restrict enrichment to specific items |
| 124 | +- Main outputs: |
| 125 | + - refreshed `markdown/<stem>.md` |
| 126 | + - refreshed cleaner/parquet metadata after OCR reruns |
| 127 | + - `json/<stem>.latex_map.jsonl` when enrichment runs |
| 128 | + |
| 129 | +## formula_enrich_from_json() |
| 130 | + |
| 131 | +```python |
| 132 | +formula_enrich_from_json( |
| 133 | + files: list[str] | None = None, |
| 134 | + *, |
| 135 | + device: str = 'cuda', |
| 136 | + batch_size: int = 8, |
| 137 | + dpi_base: int = 220, |
| 138 | + targets_by_stem: dict[str, list[tuple[int,int]]] | None = None, |
| 139 | +) -> None |
| 140 | +``` |
| 141 | + |
| 142 | +- Purpose: Phase‑2 GPU enrichment from previously exported Docling JSON. |
| 143 | +- Typical inputs: |
| 144 | + - `json/<stem>.docling.json(.zst)` |
| 145 | + - optional formula/code index data |
| 146 | +- Important parameters: |
| 147 | + - `files`: restrict to specific stems |
| 148 | + - `device`, `batch_size`, `dpi_base` |
| 149 | + - `targets_by_stem`: target specific `(page_no, item_index)` tuples |
| 150 | +- Main outputs: |
| 151 | + - enriched markdown back into `markdown/<stem>.md` |
| 152 | + - `json/<stem>.latex_map.jsonl` |
| 153 | + |
| 154 | +## section(), annotate() |
| 155 | + |
| 156 | +```python |
| 157 | +section() -> None |
| 158 | +annotate(annotation_type: str = 'text', fully_annotate: bool = True) -> None |
| 159 | +``` |
| 160 | + |
| 161 | +- `section()`: |
| 162 | + - purpose: convert markdown into one row per section with structural flags |
| 163 | + - inputs: markdown selected by cleaner/parquet metadata |
| 164 | + - outputs: `sections/sections_for_annotation.parquet` |
| 165 | +- `annotate()`: |
| 166 | + - purpose: classify sections and optionally expand them into full document structure |
| 167 | + - important parameters: `annotation_type='text'|'chapter'|'auto'`, `fully_annotate` |
| 168 | + - outputs: `classified_sections.parquet` and `fully_annotated_sections.parquet` |
| 169 | + |
| 170 | +## download() |
| 171 | + |
| 172 | +```python |
| 173 | +download( |
| 174 | + input_parquet: str | Path, |
| 175 | + *, |
| 176 | + links_column: str | None = None, |
| 177 | + parallelize_by: str | None = None, |
| 178 | + verbose: bool | None = None, |
| 179 | + **kwargs, |
| 180 | +) -> pd.DataFrame |
| 181 | +``` |
| 182 | + |
| 183 | +- Purpose: fetch source files described in a parquet dataset. |
| 184 | +- Typical inputs: |
| 185 | + - an explicit `input_parquet` |
| 186 | + - or the first parquet file found in `input_dir` |
| 187 | +- Important parameters: |
| 188 | + - `links_column`: override URL column name |
| 189 | + - `parallelize_by`: choose grouping for the scheduler |
| 190 | + - downloader kwargs via `**kwargs` for concurrency, SSL, cookies, retries, checkpoints, etc. |
| 191 | +- Main outputs: |
| 192 | + - downloaded files in `downloads/` |
| 193 | + - partial/final results in `download_results/` |
| 194 | + - returned `pd.DataFrame` with download status and metadata |
| 195 | + |
| 196 | +## triage_math() |
| 197 | + |
| 198 | +- Purpose: summarize per-page metrics and recommend Phase‑2 for math-dense docs. |
| 199 | +- Inputs: `json/metrics/<stem>.per_page.metrics.json` |
| 200 | +- Outputs: updated `download_results` parquet with routing fields such as formula totals and phase recommendation |
| 201 | + |
| 202 | +## Suggested Reading Order |
| 203 | + |
| 204 | +1. `download()` if you start from URLs. |
| 205 | +2. `extract()` for Phase‑1 layout/markdown. |
| 206 | +3. `clean()` to decide what needs OCR. |
| 207 | +4. `ocr()` if you need OCR retry or Phase‑2 enrichment. |
| 208 | +5. `section()` and `annotate()` for structured downstream outputs. |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +See also: |
| 213 | +- Code map: ../code_map.md |
| 214 | +- Pipeline overview and artifacts: ../pipeline.md |
| 215 | +- Configuration and environment variables: ../configuration.md |
| 216 | +- OCR and math enrichment details: ../ocr_and_math_enhancement.md |
0 commit comments