-
Notifications
You must be signed in to change notification settings - Fork 520
[MRG] Switch from {pdfminer.six, pypdf} to PAVÉS #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This does not quite work yet
This reverts commit 89e4fcf.
|
Wauw.. This is impressive.. Potentially, this library could give a performance boost. I noticed some benchmark files in the repo. Quickly, approved the tests.. But this needs some serious dig trough.. haha it is a huge diff. @vinayak-mehta What is your opinion on this? |
|
Cheers and thanks for running the CI... I'll go take a look at the failures and try to fix them today (probably some things that just need typing-extensions) So, while PLAYA itself is considerably faster than pdfminer.six (see https://github.com/dhdaines/benchmarks), the reimplementation of layout analysis in PAVÉS is roughly the same speed due to the overhead of translating from one API to another but also because of some "bug-compatibility" adjustments. I think that the other changes in here (not extracting individual pages, applying rotation lazily) could improve performance, I'm going to try this out on some large documents to see. If you ever want to test on a really big document with a lot of tables in it, planning bylaws are always a good choice, for instance: https://www.laval.ca/reglements-permis/index-reglements/code-urbanisme/ |
|
To give you an idea of the raw speed improvement of PLAYA-PDF versus pdfminer.six, on this 3864-page PDF from docling-project/docling#2077 on a rather slow old computer (Core i7-860 @2.8GHz), I get: $ time pdf2txt.py ~/pg4500.pdf > /dev/null # pdfminer.six
real 5m30.733s
user 5m30.487s
sys 0m0.172s
$ time playa --text ~/pg4500.pdf > /dev/null # PLAYA-PDF, 1 CPU
real 2m8.617s
user 2m8.496s
sys 0m0.104s
$ time playa --text -w 4 ~/pg4500.pdf > /dev/null # PLAYA-PDF, 4 CPUs
real 0m41.454s
user 2m41.617s
sys 0m0.596sOf course, pypdfium2 is even faster, but it is difficult to get access to PDF internals with it, and you lose a lot of information by just using its text extraction, as good as it is. For the specific case of layout analysis as used by Camelot, despite the overhead mentioned above, for large documents, PAVÉS (on a single CPU) is about 10-25% faster than pdfminer.six, depending on the input document. For instance on the 1146-page planning bylaw mentioned above: $ python benchmarks/miner.py -n 1 cdu-1-reglement.pdf
PAVÉS (1 CPUs) took 197.10s
pdfminer.six (single) took 263.80sNote that due to various types of overhead (mostly serializing/deserializing all those $ python benchmarks/miner.py -n 2 --no-miner cdu-1-reglement.pdf
PAVÉS (2 CPUs) took 116.53s
$ python benchmarks/miner.py -n 4 --no-miner cdu-1-reglement.pdf
PAVÉS (4 CPUs) took 73.97s
$ python benchmarks/miner.py -n 8 --no-miner cdu-1-reglement.pdf
PAVÉS (8 CPUs) took 66.57sThis is using the very simple benchmark code in PAVÉS which basically just runs layout analysis: https://github.com/dhdaines/paves/blob/main/benchmarks/miner.py I'll benchmark this MR and post the results below soon :-) |
|
On the planning document mentioned above, with the Extracted 928 tables, 2194.21 sec / runWith this MR using PLAYA/PAVÉS, I get: Extracted 928 tables, 1397.03 sec / runSo, a nice 36.3% speedup! |
|
The document above is much too large to use parallel processing (I notice that Camelot's memory consumption is a bit excessive in general, in fact...) so I tested with a different one: https://ville.sainte-adele.qc.ca/upload/documents/Rgl-1314-2021-PIIA-en-vigueur-20240516.pdf With the master branch (using pdfminer.six) and 8 CPUs: Extracted 37 tables, 34.66 sec / run This branch with 8 CPUs: Extracted 37 tables, 29.07 sec / run So just a 16% speedup on this document :) |
|
Will also fix #620 now, which is the reason I couldn't parallel parse the 1146-page document above. Now that I can, I get nice speed and constant memory usage (around 200MB resident size per worker) with 8 CPUs: Extracted 928 tables, 343.00 sec / run This is definitely a lot faster than Docling (i.e. RT-DETR for detection + TableFormer for structure prediction) on CPU and also (now) doesn't suffer from the same memory leak problems (when you lie down with C++, you wake up with memory leaks). |
|
Keeping python 3.8 is a nice to have. I was thinking about dropping support. @vinayak-mehta Can you please have a look at this important pr?? I'm about to give this a 👍 @dhdaines Thanks for all your work on this. |
Yeah, I think there are a lot of installations of it still out there (Ubuntu 20 and friends) in the same way that Python 3.6 lives on eternally in a billion CentOS 7 installations :-( The remaining issue with 3.8 appears to be something to do with the
I knew about PyPy but not mypyc, I will definitely take a look at that! The core parser in PLAYA-PDF just uses Python's regex engine so probably can't be optimized much more but there are a bunch of other parts that are strongly typed and would benefit immensely from being written in a lower-level language (or even in JavaScript...) |
|
The python 3.8 failure happens with 3.8.18 but not with 3.8.20, it appears to be a bug in setuptools. I've added a base dependency on |
This PR does quite a few things! If you think it's too intrusive, I have a
bridge to sell youdifferent one which simply replacespdfminer.sixwith PAVÉS. But first, read on:pypdfsince it seems a bit excessive two have three different PDF parsers in the sametrenchcoatlibrary (pypdfium2, pdfminer.six, and pypdf), but this means that:pdftopngdoesn't actually allow you to extract individual pages and in fact always seems to extract either the first or the last page, replace it withpdftocairofrom thepoppler-tools(you may wish to not do this)StrByteTypesince it's not possible for the paths constructed indownload_urlto ever bebytes(paths can bebytes... but these ones never will be as they're being constructed from astr)Why would you want to do all this?
This actually should be ready to merge, but I'm not super sure about that [MRG] tag since it touches a lot of code.