Skip to content

Conversation

@dhdaines
Copy link

@dhdaines dhdaines commented Aug 2, 2025

This PR does quite a few things! If you think it's too intrusive, I have a bridge to sell you different one which simply replaces pdfminer.six with PAVÉS. But first, read on:

  • Remove dependency on pypdf since it seems a bit excessive two have three different PDF parsers in the same trenchcoat library (pypdfium2, pdfminer.six, and pypdf), but this means that:
  • Instead of extracting every.single.page to a temporary file, the image conversion backends have been updated to be able to do one page at a time (this is, however, still kind of inefficient)
  • Because pdftopng doesn't actually allow you to extract individual pages and in fact always seems to extract either the first or the last page, replace it with pdftocairo from the poppler-tools (you may wish to not do this)
  • Because we no longer get to rotate the pages when we extract them, update all of the code to explicitly apply rotation, first to the CTM when doing layout extraction, then to the image when doing lattice parsing and plotting
  • Remove the use of StrByteType since it's not possible for the paths constructed in download_url to ever be bytes (paths can be bytes ... but these ones never will be as they're being constructed from a str)

Why would you want to do all this?

  • pdfminer.six is great but it isn't very robust
  • pypdf is really nice but as mentioned above no reason to have three different PDF parsers
  • reduce code complexity

This actually should be ready to merge, but I'm not super sure about that [MRG] tag since it touches a lot of code.

@bosd
Copy link
Collaborator

bosd commented Aug 10, 2025

Wauw.. This is impressive..
recently I became aware that pdfminer.six is the achilles heel of this library.
I was looking into of, this library and pdfminer.six could be compiled with mypyc.

Potentially, this library could give a performance boost. I noticed some benchmark files in the repo.
Is there some data available.

Quickly, approved the tests.. But this needs some serious dig trough.. haha it is a huge diff.

@vinayak-mehta What is your opinion on this?

@bosd bosd requested a review from vinayak-mehta August 10, 2025 19:21
@bosd bosd added enhancement New feature or request good first issue Good for newcomers dependencies Pull requests that update a dependency file performance Performance python Pull requests that update Python code labels Aug 10, 2025
@bosd bosd added the refactoring Refactoring label Aug 10, 2025
@dhdaines
Copy link
Author

Cheers and thanks for running the CI... I'll go take a look at the failures and try to fix them today (probably some things that just need typing-extensions)

So, while PLAYA itself is considerably faster than pdfminer.six (see https://github.com/dhdaines/benchmarks), the reimplementation of layout analysis in PAVÉS is roughly the same speed due to the overhead of translating from one API to another but also because of some "bug-compatibility" adjustments.

I think that the other changes in here (not extracting individual pages, applying rotation lazily) could improve performance, I'm going to try this out on some large documents to see.

If you ever want to test on a really big document with a lot of tables in it, planning bylaws are always a good choice, for instance: https://www.laval.ca/reglements-permis/index-reglements/code-urbanisme/

@dhdaines
Copy link
Author

To give you an idea of the raw speed improvement of PLAYA-PDF versus pdfminer.six, on this 3864-page PDF from docling-project/docling#2077 on a rather slow old computer (Core i7-860 @2.8GHz), I get:

$ time pdf2txt.py ~/pg4500.pdf  > /dev/null  # pdfminer.six
real    5m30.733s
user    5m30.487s
sys     0m0.172s

$ time playa --text ~/pg4500.pdf  > /dev/null  # PLAYA-PDF, 1 CPU
real    2m8.617s
user    2m8.496s
sys     0m0.104s

$ time playa --text -w 4 ~/pg4500.pdf  > /dev/null  # PLAYA-PDF, 4 CPUs
real    0m41.454s
user    2m41.617s
sys     0m0.596s

Of course, pypdfium2 is even faster, but it is difficult to get access to PDF internals with it, and you lose a lot of information by just using its text extraction, as good as it is.

For the specific case of layout analysis as used by Camelot, despite the overhead mentioned above, for large documents, PAVÉS (on a single CPU) is about 10-25% faster than pdfminer.six, depending on the input document. For instance on the 1146-page planning bylaw mentioned above:

$ python benchmarks/miner.py -n 1 cdu-1-reglement.pdf 
PAVÉS (1 CPUs) took 197.10s
pdfminer.six (single) took 263.80s

Note that due to various types of overhead (mostly serializing/deserializing all those LTComponent) more CPUs doesn't scale linearly (the computer in question has 4 cores / 8 threads):

$ python benchmarks/miner.py -n 2 --no-miner cdu-1-reglement.pdf 
PAVÉS (2 CPUs) took 116.53s
$ python benchmarks/miner.py -n 4 --no-miner cdu-1-reglement.pdf 
PAVÉS (4 CPUs) took 73.97s
$ python benchmarks/miner.py -n 8 --no-miner cdu-1-reglement.pdf 
PAVÉS (8 CPUs) took 66.57s

This is using the very simple benchmark code in PAVÉS which basically just runs layout analysis: https://github.com/dhdaines/paves/blob/main/benchmarks/miner.py

I'll benchmark this MR and post the results below soon :-)

@dhdaines
Copy link
Author

dhdaines commented Aug 16, 2025

On the planning document mentioned above, with the lattice parser, I get this with pdfminer.six and the current master branch:

Extracted 928 tables, 2194.21 sec / run

With this MR using PLAYA/PAVÉS, I get:

Extracted 928 tables, 1397.03 sec / run

So, a nice 36.3% speedup!

@dhdaines
Copy link
Author

The document above is much too large to use parallel processing (I notice that Camelot's memory consumption is a bit excessive in general, in fact...) so I tested with a different one: https://ville.sainte-adele.qc.ca/upload/documents/Rgl-1314-2021-PIIA-en-vigueur-20240516.pdf

With the master branch (using pdfminer.six) and 8 CPUs: Extracted 37 tables, 34.66 sec / run

This branch with 8 CPUs: Extracted 37 tables, 29.07 sec / run

So just a 16% speedup on this document :)

@dhdaines
Copy link
Author

dhdaines commented Aug 17, 2025

Will also fix #620 now, which is the reason I couldn't parallel parse the 1146-page document above. Now that I can, I get nice speed and constant memory usage (around 200MB resident size per worker) with 8 CPUs:

Extracted 928 tables, 343.00 sec / run

This is definitely a lot faster than Docling (i.e. RT-DETR for detection + TableFormer for structure prediction) on CPU and also (now) doesn't suffer from the same memory leak problems (when you lie down with C++, you wake up with memory leaks).

@bosd
Copy link
Collaborator

bosd commented Aug 19, 2025

Keeping python 3.8 is a nice to have. I was thinking about dropping support.

@vinayak-mehta Can you please have a look at this important pr??

I'm about to give this a 👍
Some administrative commits could be squashed first,

@dhdaines Thanks for all your work on this.
Earlier I was investigating a bit to speedup pdfminer.six But seemed like a huge task to compile it.
Have you experimented / looked at mypyc?
For a lot of projects it, could really speedup the python part.

@dhdaines
Copy link
Author

dhdaines commented Aug 19, 2025

Keeping python 3.8 is a nice to have. I was thinking about dropping support.

Yeah, I think there are a lot of installations of it still out there (Ubuntu 20 and friends) in the same way that Python 3.6 lives on eternally in a billion CentOS 7 installations :-(

The remaining issue with 3.8 appears to be something to do with the pyproject.toml file? Not quite sure what's going on...

@dhdaines Thanks for all your work on this. Earlier I was investigating a bit to speedup pdfminer.six But seemed like a huge task to compile it. Have you experimented / looked at mypyc? For a lot of projects it, could really speedup the python part.

I knew about PyPy but not mypyc, I will definitely take a look at that! The core parser in PLAYA-PDF just uses Python's regex engine so probably can't be optimized much more but there are a bunch of other parts that are strongly typed and would benefit immensely from being written in a lower-level language (or even in JavaScript...)

@dhdaines
Copy link
Author

The python 3.8 failure happens with 3.8.18 but not with 3.8.20, it appears to be a bug in setuptools. I've added a base dependency on setuptools>=75 (the last version that works with 3.8) to the noxfile, hopefully that will fix it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file enhancement New feature or request good first issue Good for newcomers performance Performance python Pull requests that update Python code refactoring Refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants