multiprocessing - issue to load page - TypeError: cannot pickle 'SwigPyObject' object #2337

OrianeN · 2023-04-12T09:54:05Z

OrianeN
Apr 12, 2023

I'm struggling to use multiprocessing to process multiple pages in parallel. I've found some useful documentation in other GitHub issues, and the official doc (https://pymupdf.readthedocs.io/en/latest/recipes-multiprocessing.html) which helped me understand that the fitz.Document must be reloaded in each process.

I tried adapting the code to my needs but failed, here's a small reproducible example:

import os
from multiprocessing import Pool
from functools import partial

import fitz

filename = r"C:\Users\onedey\AMPLEXOR\AI team - Documents\OCR\Terminology\ROLLS ROYCE.pdf"


def load_page(page_i, filename):
    print("Loading file")
    fitz_doc = fitz.Document(filename)
    print("Loading page")
    page = fitz_doc[page_i]
    print(f"Page {page_i} loaded")
    return page


def dummy_pipeline(filename, page_indexes, nb_cpu=os.cpu_count()-1):
    with Pool(processes=nb_cpu) as pool:
        partial_func = partial(load_page, filename=filename)
        pages = pool.map(partial_func, page_indexes)
        pool.close()
        pool.join()
    return pages


if __name__ == "__main__":
    pages = dummy_pipeline(filename, [0])
    print(len(pages))

The stdout shows that the page is loaded, but there are some errors raised, and then the script keeps running indefinitely.

Loading file
Loading page
Page 0 loaded
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle 'SwigPyObject' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 133, in worker
    wrapped = MaybeEncodingError(e, result[1])
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 86, in __init__
    self.value = repr(value)
  File "/mnt/c/venvs/babbling/venv_u_babbling/lib/python3.8/site-packages/fitz/fitz.py", line 6731, in __repr__
    CheckParent(self)
  File "/mnt/c/venvs/babbling/venv_u_babbling/lib/python3.8/site-packages/fitz/fitz.py", line 2917, in CheckParent
    raise ValueError("orphaned object: parent is None")
ValueError: orphaned object: parent is None

Can anybody understand what I'm doing wrong and why it's not working ? I'm sorry if it's not related enough to the fitz module.

Your configuration (mandatory)

Operating system, potentially version and bitness:
- tried both on WSL Ubuntu 20.04 and Windows 10 (64 bits)
Python version, bitness: Python 3.8
PyMuPDF version, installation method (wheel or generated from source): 1.21.1 (installed with pip install)

Answered by OrianeN

Apr 12, 2023

thank you very much for clarification !

View full answer

JorjMcKie · 2023-04-12T10:05:29Z

JorjMcKie
Apr 12, 2023
Maintainer

This is a "Discussions" item, so let's first transfer.

2 replies

OrianeN Apr 12, 2023
Author

sorry about that (I'm not used to Discussions on GitHub yet), and thanks !

JorjMcKie Apr 12, 2023
Maintainer

no problem at all

OrianeN · 2023-04-12T10:08:11Z

OrianeN
Apr 12, 2023
Author

I think I now understood the issue: I try returning the page object when I should return the result of my process (for instance page.get_text("blocks")) !

10 replies

JorjMcKie Apr 12, 2023
Maintainer

The parameters handed over to each single process are internally pickled. And this does not work for arbitrary objects, e.g. not for fitz.Document, fitz.Page, etc.
You could for example use the filename and the page number instead.

JorjMcKie Apr 12, 2023
Maintainer

so inside each single process, you need to open (= create) the document and load the page by its number.

OrianeN Apr 12, 2023
Author

thank you very much for clarification !

Answer selected by JorjMcKie

greenomy-matteo Sep 26, 2023

Hello, is there a way to avoid doing so? opening and closing the doc is not viable for me (remote file, hundres of pages).

JorjMcKie Sep 26, 2023
Maintainer

@greenomy-matteo No. If you want speed (= the only logical motivation for multiprocessing), then you must download the file and process it locally.

JorjMcKie Sep 26, 2023
Maintainer

This does not mean necessarily that you must store it to a local disk BTW.

greenomy-matteo Sep 26, 2023

@JorjMcKie sure, I understand and you are right!

I still have one problem with that that you might help me solve: in the operation that i look to "parallelise" I extract the text of the page, split into sentences (using a torch model) and, finally, i extract the rectangles for each of this sentences by using the page.search_for() method.

As a consequence, it's quite problematic for me to avoid using the Page object since I need, for each sentence I extract, to search for its rectangles. Any suggestion on that or I have to accept that this cannot be parallised?

multiprocessing - issue to load page - TypeError: cannot pickle 'SwigPyObject' object #2337

Uh oh!

OrianeN Apr 12, 2023

Your configuration (mandatory)

Replies: 2 comments · 12 replies

Uh oh!

JorjMcKie Apr 12, 2023 Maintainer

Uh oh!

OrianeN Apr 12, 2023 Author

Uh oh!

JorjMcKie Apr 12, 2023 Maintainer

Uh oh!

OrianeN Apr 12, 2023 Author

Uh oh!

JorjMcKie Apr 12, 2023 Maintainer

Uh oh!

JorjMcKie Apr 12, 2023 Maintainer

Uh oh!

OrianeN Apr 12, 2023 Author

Uh oh!

greenomy-matteo Sep 26, 2023

Uh oh!

JorjMcKie Sep 26, 2023 Maintainer

Uh oh!

JorjMcKie Sep 26, 2023 Maintainer

Uh oh!

greenomy-matteo Sep 26, 2023

OrianeN
Apr 12, 2023

Replies: 2 comments 12 replies

JorjMcKie
Apr 12, 2023
Maintainer

OrianeN Apr 12, 2023
Author

JorjMcKie Apr 12, 2023
Maintainer

OrianeN
Apr 12, 2023
Author

JorjMcKie Apr 12, 2023
Maintainer

JorjMcKie Apr 12, 2023
Maintainer

OrianeN Apr 12, 2023
Author

JorjMcKie Sep 26, 2023
Maintainer

JorjMcKie Sep 26, 2023
Maintainer