Skip to content

Commit f8d53ee

Browse files
feat: add perf tools (#165)
Signed-off-by: Peter Staar <[email protected]>
1 parent c17dd3a commit f8d53ee

File tree

16 files changed

+793
-46
lines changed

16 files changed

+793
-46
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ build
44
dist
55
extlib_*/
66
scratch_*
7+
perf/results/**
78

89
# Created by https://www.toptal.com/developers/gitignore/api/python,macos,emacs,cmake,virtualenv
910
# Edit at https://www.toptal.com/developers/gitignore?templates=python,macos,emacs,cmake,virtualenv

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,8 @@ uv sync
191191
The latter will only work after a clean `git clone`. If you are developing and updating C++ code, please use,
192192

193193
```sh
194-
uv pip install --force-reinstall --no-deps -e .
194+
# uv pip install --force-reinstall --no-deps -e .
195+
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"
195196
```
196197

197198
To test the package, run:

docling_parse/pdf_parser.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -446,8 +446,8 @@ def _to_segmented_page(
446446
"`words` will be created for segmented_page in an inefficient way!"
447447
)
448448
self._create_word_cells(segmented_page, enforce_same_font=enforce_same_font)
449-
else:
450-
logging.warning("No `words` will be created for segmented_page")
449+
# else:
450+
# logging.warning("No `words` will be created for segmented_page")
451451

452452
if create_textlines and ("line_cells" in page):
453453
segmented_page.textline_cells = self._to_cells(page["line_cells"])
@@ -459,8 +459,8 @@ def _to_segmented_page(
459459
self._create_textline_cells(
460460
segmented_page, enforce_same_font=enforce_same_font
461461
)
462-
else:
463-
logging.warning("No `text_lines` will be created for segmented_page")
462+
# else:
463+
# logging.warning("No `text_lines` will be created for segmented_page")
464464

465465
return segmented_page
466466

perf/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Perf tools for page-level parsing benchmarking.
2+
3+
Usage
4+
- Install extras for optional parsers (not part of main package):
5+
- pip: `pip install .[perf-tools]`
6+
- uv (already configured): `uv sync --group perf-test`
7+
- Run on a file or directory:
8+
- `python perf/run_perf.py ./docs/sample.pdf`
9+
- `python perf/run_perf.py ./dataset --recursive -p pdfplumber`
10+
11+
CLI
12+
- `input`: PDF file or directory of PDFs.
13+
- `--parser|-p`: one of `docling` (default), `pdfplumber`, `pypdfium2` (alias: `pypdfium`), `pymupdf`.
14+
- `--recursive|-r`: recurse when input is a directory.
15+
- `--output|-o`: output CSV path (default under `perf/results`).
16+
17+
CSV columns
18+
- `filename,page_number,elapsed_sec,success,error`
19+
20+
Statistics
21+
- Prints totals, avg sec/page, min/max, and percentiles (p50/p90/p95/p99) after the run.
22+

0 commit comments

Comments
 (0)