[MRG] Switch from {pdfminer.six, pypdf} to PAVÉS #589

dhdaines · 2025-08-02T00:40:35Z

This PR does quite a few things! If you think it's too intrusive, I have a ~~bridge to sell you~~ different one which simply replaces pdfminer.six with PAVÉS. But first, read on:

Remove dependency on pypdf since it seems a bit excessive two have three different PDF parsers in the same ~~trenchcoat~~ library (pypdfium2, pdfminer.six, and pypdf), but this means that:
Instead of extracting every.single.page to a temporary file, the image conversion backends have been updated to be able to do one page at a time (this is, however, still kind of inefficient)
Because pdftopng doesn't actually allow you to extract individual pages and in fact always seems to extract either the first or the last page, replace it with pdftocairo from the poppler-tools (you may wish to not do this)
Because we no longer get to rotate the pages when we extract them, update all of the code to explicitly apply rotation, first to the CTM when doing layout extraction, then to the image when doing lattice parsing and plotting
Remove the use of StrByteType since it's not possible for the paths constructed in download_url to ever be bytes (paths can be bytes ... but these ones never will be as they're being constructed from a str)

Why would you want to do all this?

pdfminer.six is great but it isn't very robust
pypdf is really nice but as mentioned above no reason to have three different PDF parsers
reduce code complexity

This actually should be ready to merge, but I'm not super sure about that [MRG] tag since it touches a lot of code.

This does not quite work yet

This reverts commit 89e4fcf.

bosd · 2025-08-10T19:21:08Z

Wauw.. This is impressive..
recently I became aware that pdfminer.six is the achilles heel of this library.
I was looking into of, this library and pdfminer.six could be compiled with mypyc.

Potentially, this library could give a performance boost. I noticed some benchmark files in the repo.
Is there some data available.

Quickly, approved the tests.. But this needs some serious dig trough.. haha it is a huge diff.

@vinayak-mehta What is your opinion on this?

dhdaines · 2025-08-16T17:09:59Z

Cheers and thanks for running the CI... I'll go take a look at the failures and try to fix them today (probably some things that just need typing-extensions)

So, while PLAYA itself is considerably faster than pdfminer.six (see https://github.com/dhdaines/benchmarks), the reimplementation of layout analysis in PAVÉS is roughly the same speed due to the overhead of translating from one API to another but also because of some "bug-compatibility" adjustments.

I think that the other changes in here (not extracting individual pages, applying rotation lazily) could improve performance, I'm going to try this out on some large documents to see.

If you ever want to test on a really big document with a lot of tables in it, planning bylaws are always a good choice, for instance: https://www.laval.ca/reglements-permis/index-reglements/code-urbanisme/

dhdaines · 2025-08-16T17:37:15Z

To give you an idea of the raw speed improvement of PLAYA-PDF versus pdfminer.six, on this 3864-page PDF from docling-project/docling#2077 on a rather slow old computer (Core i7-860 @2.8GHz), I get:

$ time pdf2txt.py ~/pg4500.pdf  > /dev/null  # pdfminer.six
real    5m30.733s
user    5m30.487s
sys     0m0.172s

$ time playa --text ~/pg4500.pdf  > /dev/null  # PLAYA-PDF, 1 CPU
real    2m8.617s
user    2m8.496s
sys     0m0.104s

$ time playa --text -w 4 ~/pg4500.pdf  > /dev/null  # PLAYA-PDF, 4 CPUs
real    0m41.454s
user    2m41.617s
sys     0m0.596s

Of course, pypdfium2 is even faster, but it is difficult to get access to PDF internals with it, and you lose a lot of information by just using its text extraction, as good as it is.

For the specific case of layout analysis as used by Camelot, despite the overhead mentioned above, for large documents, PAVÉS (on a single CPU) is about 10-25% faster than pdfminer.six, depending on the input document. For instance on the 1146-page planning bylaw mentioned above:

$ python benchmarks/miner.py -n 1 cdu-1-reglement.pdf 
PAVÉS (1 CPUs) took 197.10s
pdfminer.six (single) took 263.80s

Note that due to various types of overhead (mostly serializing/deserializing all those LTComponent) more CPUs doesn't scale linearly (the computer in question has 4 cores / 8 threads):

$ python benchmarks/miner.py -n 2 --no-miner cdu-1-reglement.pdf 
PAVÉS (2 CPUs) took 116.53s
$ python benchmarks/miner.py -n 4 --no-miner cdu-1-reglement.pdf 
PAVÉS (4 CPUs) took 73.97s
$ python benchmarks/miner.py -n 8 --no-miner cdu-1-reglement.pdf 
PAVÉS (8 CPUs) took 66.57s

This is using the very simple benchmark code in PAVÉS which basically just runs layout analysis: https://github.com/dhdaines/paves/blob/main/benchmarks/miner.py

I'll benchmark this MR and post the results below soon :-)

dhdaines · 2025-08-16T22:28:26Z

On the planning document mentioned above, with the lattice parser, I get this with pdfminer.six and the current master branch:

Extracted 928 tables, 2194.21 sec / run

With this MR using PLAYA/PAVÉS, I get:

Extracted 928 tables, 1397.03 sec / run

So, a nice 36.3% speedup!

dhdaines · 2025-08-16T23:08:55Z

The document above is much too large to use parallel processing (I notice that Camelot's memory consumption is a bit excessive in general, in fact...) so I tested with a different one: https://ville.sainte-adele.qc.ca/upload/documents/Rgl-1314-2021-PIIA-en-vigueur-20240516.pdf

With the master branch (using pdfminer.six) and 8 CPUs: Extracted 37 tables, 34.66 sec / run

This branch with 8 CPUs: Extracted 37 tables, 29.07 sec / run

So just a 16% speedup on this document :)

Fixes: camelot-dev#620

dhdaines · 2025-08-17T00:30:38Z

Will also fix #620 now, which is the reason I couldn't parallel parse the 1146-page document above. Now that I can, I get nice speed and constant memory usage (around 200MB resident size per worker) with 8 CPUs:

Extracted 928 tables, 343.00 sec / run

This is definitely a lot faster than Docling (i.e. RT-DETR for detection + TableFormer for structure prediction) on CPU and also (now) doesn't suffer from the same memory leak problems (when you lie down with C++, you wake up with memory leaks).

bosd · 2025-08-19T17:33:22Z

Keeping python 3.8 is a nice to have. I was thinking about dropping support.

@vinayak-mehta Can you please have a look at this important pr??

I'm about to give this a 👍
Some administrative commits could be squashed first,

@dhdaines Thanks for all your work on this.
Earlier I was investigating a bit to speedup pdfminer.six But seemed like a huge task to compile it.
Have you experimented / looked at mypyc?
For a lot of projects it, could really speedup the python part.

dhdaines · 2025-08-19T19:50:32Z

Keeping python 3.8 is a nice to have. I was thinking about dropping support.

Yeah, I think there are a lot of installations of it still out there (Ubuntu 20 and friends) in the same way that Python 3.6 lives on eternally in a billion CentOS 7 installations :-(

The remaining issue with 3.8 appears to be something to do with the pyproject.toml file? Not quite sure what's going on...

@dhdaines Thanks for all your work on this. Earlier I was investigating a bit to speedup pdfminer.six But seemed like a huge task to compile it. Have you experimented / looked at mypyc? For a lot of projects it, could really speedup the python part.

I knew about PyPy but not mypyc, I will definitely take a look at that! The core parser in PLAYA-PDF just uses Python's regex engine so probably can't be optimized much more but there are a bunch of other parts that are strongly typed and would benefit immensely from being written in a lower-level language (or even in JavaScript...)

dhdaines · 2025-08-20T12:22:28Z

The python 3.8 failure happens with 3.8.18 but not with 3.8.20, it appears to be a bug in setuptools. I've added a base dependency on setuptools>=75 (the last version that works with 3.8) to the noxfile, hopefully that will fix it...

David Huggins-Daines added 22 commits August 1, 2025 09:04

feat: replace pdfminer with paves.miner

e350b05

deps: update deps

f418442

feat!: remove dependency on pypdf by not extracting pages

c2cb2f9

This does not quite work yet

fix: nope gotta use those laparams

773a06e

chore: foo

cb85d80

fix: support parallel like before

987c92a

fix: make error messages match

41dc89d

fix: raise from

43e0a0e

fix!: allow converting other than page 1

66971d5

fix: update error message

c8b612a

fix: rely on new playa that lets you set rotation

61e3130

fix: apply rotation in image processing and plotting

afd8acb

chore: lock

b32ce33

fix: apply rotation to threshold too

0608dce

fix(tests): ensure ultimate error message compatibility

b7fc7a3

fix(types): fix types

92ff79f

chore: isort

a994198

chore: blacken

a58fa0f

revert: go back to master for ntoebook

b5e04b3

fix: is_extractable works in strange ways...

89e4fcf

Revert "fix: is_extractable works in strange ways..."

3ce32a9

This reverts commit 89e4fcf.

fix(tests): verify that no-extraction is respected

f824257

dhdaines mentioned this pull request Aug 2, 2025

claims to respect access permissions on encrypted PDFs but actually doesn't #590

Open

bosd requested a review from vinayak-mehta August 10, 2025 19:21

bosd added enhancement New feature or request good first issue Good for newcomers dependencies Pull requests that update a dependency file performance Performance python Pull requests that update Python code labels Aug 10, 2025

bosd added the refactoring Refactoring label Aug 10, 2025

dhdaines added 2 commits August 16, 2025 14:11

fix: remove unused import

065e83e

fix(types): remove test that cannot possibly work and has bad types

c6371e0

dhdaines mentioned this pull request Aug 17, 2025

Memory usage grows out of control on large PDFs due to saving images in lattice parser #620

Open

fix: render the correct page and do not save images in lattice parser

dc35e24

Fixes: camelot-dev#620

fix(deps): restore python 3.8 compatibility with latest playa

64e5b50

dhdaines mentioned this pull request Aug 17, 2025

Support for alternative PDF parser #376

Open

fix(tests): add setuptools dependency to hopefully fix py3.8 tests

4a54f0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] Switch from {pdfminer.six, pypdf} to PAVÉS #589

[MRG] Switch from {pdfminer.six, pypdf} to PAVÉS #589

Uh oh!

dhdaines commented Aug 2, 2025 •

edited

Loading

Uh oh!

bosd commented Aug 10, 2025

Uh oh!

dhdaines commented Aug 16, 2025

Uh oh!

dhdaines commented Aug 16, 2025

Uh oh!

dhdaines commented Aug 16, 2025 •

edited

Loading

Uh oh!

dhdaines commented Aug 16, 2025

Uh oh!

dhdaines commented Aug 17, 2025 •

edited

Loading

Uh oh!

bosd commented Aug 19, 2025

Uh oh!

dhdaines commented Aug 19, 2025 •

edited

Loading

Uh oh!

dhdaines commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MRG] Switch from {pdfminer.six, pypdf} to PAVÉS #589

Are you sure you want to change the base?

[MRG] Switch from {pdfminer.six, pypdf} to PAVÉS #589

Uh oh!

Conversation

dhdaines commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosd commented Aug 10, 2025

Uh oh!

dhdaines commented Aug 16, 2025

Uh oh!

dhdaines commented Aug 16, 2025

Uh oh!

dhdaines commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhdaines commented Aug 16, 2025

Uh oh!

dhdaines commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosd commented Aug 19, 2025

Uh oh!

dhdaines commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhdaines commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhdaines commented Aug 2, 2025 •

edited

Loading

dhdaines commented Aug 16, 2025 •

edited

Loading

dhdaines commented Aug 17, 2025 •

edited

Loading

dhdaines commented Aug 19, 2025 •

edited

Loading