Skip to content

Memory usage grows out of control on large PDFs due to saving images in lattice parser #620

@dhdaines

Description

@dhdaines

See #28 - on a large PDF with a lot of images, the fact that Camelot is doing this:

        # for plotting
        table._image = self.pdf_image  # Reuse the image used for calc                                                                                                                          

Leads to ever-increasing memory consumption, which is usually fatal in the case of parallel processing. For instance on this document of 1100+ pages (and 928 tables detected by Camelot) it ends up using some 20GB of memory: https://www.laval.ca/wp-content/uploads/2025/02/cdu-1-reglement.pdf - if I remove that line, memory usage stays constant around 250MB per worker process.

But also, reusing the image like this is just unnecessary, because in the case where the user wants to do some plotting, the page image would seem to get regenerated anyway if it didn't already exist.

EXCEPT! The one-page-at-a-time assumption pervasive in Camelot strikes again, as the code mentioned above won't render the correct page if the table isn't on page 1 (and thus plotting is actually currently broken for pages other than 1 if you don't use the lattice parser)... So in fact the fixes in #589 to allow processing specific pages in the backend are also necessary to solve this problem. I've taken the liberty of fixing this in that pull request ;-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions