-
Notifications
You must be signed in to change notification settings - Fork 520
Description
See #28 - on a large PDF with a lot of images, the fact that Camelot is doing this:
# for plotting
table._image = self.pdf_image # Reuse the image used for calc Leads to ever-increasing memory consumption, which is usually fatal in the case of parallel processing. For instance on this document of 1100+ pages (and 928 tables detected by Camelot) it ends up using some 20GB of memory: https://www.laval.ca/wp-content/uploads/2025/02/cdu-1-reglement.pdf - if I remove that line, memory usage stays constant around 250MB per worker process.
But also, reusing the image like this is just unnecessary, because in the case where the user wants to do some plotting, the page image would seem to get regenerated anyway if it didn't already exist.
EXCEPT! The one-page-at-a-time assumption pervasive in Camelot strikes again, as the code mentioned above won't render the correct page if the table isn't on page 1 (and thus plotting is actually currently broken for pages other than 1 if you don't use the lattice parser)... So in fact the fixes in #589 to allow processing specific pages in the backend are also necessary to solve this problem. I've taken the liberty of fixing this in that pull request ;-)