multiprocessing - issue to load page - TypeError: cannot pickle 'SwigPyObject' object #2337
-
I'm struggling to use multiprocessing to process multiple pages in parallel. I've found some useful documentation in other GitHub issues, and the official doc (https://pymupdf.readthedocs.io/en/latest/recipes-multiprocessing.html) which helped me understand that the fitz.Document must be reloaded in each process. I tried adapting the code to my needs but failed, here's a small reproducible example: import os
from multiprocessing import Pool
from functools import partial
import fitz
filename = r"C:\Users\onedey\AMPLEXOR\AI team - Documents\OCR\Terminology\ROLLS ROYCE.pdf"
def load_page(page_i, filename):
print("Loading file")
fitz_doc = fitz.Document(filename)
print("Loading page")
page = fitz_doc[page_i]
print(f"Page {page_i} loaded")
return page
def dummy_pipeline(filename, page_indexes, nb_cpu=os.cpu_count()-1):
with Pool(processes=nb_cpu) as pool:
partial_func = partial(load_page, filename=filename)
pages = pool.map(partial_func, page_indexes)
pool.close()
pool.join()
return pages
if __name__ == "__main__":
pages = dummy_pipeline(filename, [0])
print(len(pages)) The stdout shows that the page is loaded, but there are some errors raised, and then the script keeps running indefinitely.
Can anybody understand what I'm doing wrong and why it's not working ? I'm sorry if it's not related enough to the fitz module. Your configuration (mandatory)
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 12 replies
-
This is a "Discussions" item, so let's first transfer. |
Beta Was this translation helpful? Give feedback.
-
I think I now understood the issue: I try returning the page object when I should return the result of my process (for instance |
Beta Was this translation helpful? Give feedback.
thank you very much for clarification !