Most forward way to copy a whole Document ? #2569

4lexRed · 2023-07-31T09:38:28Z

4lexRed
Jul 31, 2023

My script is loading a long pdf, splitting it into chapters and analyzing each chapter.
For debugging I add some shapes to the document and safe the result as an image.
As I do not want to alter the original doc, when drawing the shapes, I need to create an independent copy of the doc.

Somehow I noticed: it's not that "straight forward" to copy a whole in-memory pymupdf document to a new in memory object.
(googling it, I did not find any topic regarding this issue - only single pages are copied)

Using copy's module copy(org_doc) ends up not creating an dependent copy => The drawn rectangle also shows up in the original doc.
Using deepcopy(org_doc) throws the error "TypeError: cannot pickle 'SwigPyObject' object". [Probably this is not expected by the developers?]

I ended up with this workaround to copy the doc in memory

import copy
from contextlib import redirect_stderr
from io import BytesIO, StringIO

import fitz

doc = fitz.Document("test.pdf")

# not working solution -> does not create an independent copy of the doc (see pixmap at the end)
# pdf_copy = copy.copy(doc)

# workaround to copy a pdf in memory - with error message suppression
with StringIO() as buf, redirect_stderr(buf):
    pdfbytes = doc.convert_to_pdf()
    pdf_copy = fitz.open("pdf", pdfbytes)
    buffer_content = buf.getvalue()  # read the buffer content

edit_page = pdf_copy[0]   # define the page to draw the shape to

# draw one rectangle
rect = fitz.Rect(63, 203, 270, 213)
shape = edit_page.new_shape()
shape.draw_rect(rect)
shape.finish(color=(1, 0, 0), fill=(1, 1, 0), closePath=False, fill_opacity=0.3)  # yellow fill red outline
shape.commit()

# save images of the pdfs to the harddisc
pix = edit_page.get_pixmap(matrix=fitz.Matrix(2, 2))  # save the pixmap with double resolution
pix.save("test_pdf_copy_with_shapes.png")

pix = doc[0].get_pixmap(matrix=fitz.Matrix(2, 2))
pix.save("test_pdf_org.png")

Note 1:
Above code is just an example. Reading the pdf once more from the harddisk is not an option.
I also want to avoid writing and re-reading the pdfs to harddisk.

Note 2:
doc.convert_to_pdf() also throws an (neglect-able) error: cannot create ToUnicode mapping for LNMIVH+CourierNewPS-BoldMT
To suppress the error I redirect the error-out.

Final questions:

Is there a more straight forward way to copy the whole doc?
Why would deepcopy not work? (well the error message explains it - kind of)
How to avoid the error mentioned in in Note 2 ?

Answered by JorjMcKie

Jul 31, 2023

The functions copy()/deepcopy() of Python module copy are for pure Python objecs only. Most PyMuPDF objects are far more than that: they have a companion or "shadow" objects within the base library MuPDF.
Therefore these functions cannot be used for what you want.

But there is an easy way to make memory copies of the current state of the document like this:

pdfdata = doc.tobytes(<save options>)
temp = fitz.open("pdf", pdfdata)

The document temp is a full copy of doc. "Save options" means parameters you would also use when saving the document, like garbage, deflate etc.

That temporary document can be saved to disk like normal whenever deemed required.

You can create several temp objects fr…

View full answer

JorjMcKie · 2023-07-31T12:00:33Z

JorjMcKie
Jul 31, 2023
Maintainer

I think I don't completely understand what you are actually trying to do.

Just analyzing the PDF's content - no update intent?

If this is true, you can do whatever you want with the document: inserting notes, images, ... whatever.
The file on disk will not be changed. All your changes will be discarded when you close. You do not need to make a copy.

Otherwise, you can open the same file two times, which creates different Document objects, initially having identical content: doc1 = fitz.open("input.pdf") and doc2 = fitz.open("input.pdf"). They can be updated independently from each other. And because all this happens in memory (and will stay there as long as you do not save), nothing bad can happen between the two.

1 reply

4lexRed Jul 31, 2023
Author

Sorry if my explanation was a bit too complicated.
Here is what I do:

read a pdf to a Doc
split the Doc into multiple separate (in memory) Doc's by copying pages in a for loop
analyze each in-memory Doc
(and for debugging add some shapes and text - this is where I need the independent copy of the Doc)
generate a filename for each Doc based on the information from step 3
safe each in-memory Doc to disc as .pdf

As I also safe the debugging pdf, I needed an independent copy in which the shapes are not drawn.
There seems no "straight forward way" to copy an in-memory doc to another in-memory blob.

The ways that I can come up with, to create an an independent copy, are:
a) create a new in-memory doc and copy a range of pages from the original Doc to the new Doc (with a loop)
Works, but I might loose some metadata on the doc.

b) safe the in-memory Doc to disc and read the file on disc to a new Doc.
Works, but why use the slow harddisk if we have fast RAM.

c) using doc.convert_to_pdf() and safe it to an BytesIO stream and then read it from the same stream
[see first post; method issues an error message, which b) does not issue]

The straightforward way I would think of would be more like:

doc_new = deepcopy(doc)
doc_new = doc.get_bitwise_copy()

In the end I am just wondering if there is a better option than a) b) or c)
Or if it is intended that deepcopy crashes (see first post)

JorjMcKie · 2023-07-31T16:56:45Z

JorjMcKie
Jul 31, 2023
Maintainer

The functions copy()/deepcopy() of Python module copy are for pure Python objecs only. Most PyMuPDF objects are far more than that: they have a companion or "shadow" objects within the base library MuPDF.
Therefore these functions cannot be used for what you want.

But there is an easy way to make memory copies of the current state of the document like this:

pdfdata = doc.tobytes(<save options>)
temp = fitz.open("pdf", pdfdata)

The document temp is a full copy of doc. "Save options" means parameters you would also use when saving the document, like garbage, deflate etc.

That temporary document can be saved to disk like normal whenever deemed required.

You can create several temp objects from the same bytes object pdfdata. That object can be deleted at any tme, because a copy is stored inside the created Document.

1 reply

4lexRed Jul 31, 2023
Author

Thx Jorj!
Somehow I have missed that method during the research on google and the pymupdf manual.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Most forward way to copy a whole Document ? #2569

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Most forward way to copy a whole Document ? #2569

Uh oh!

Uh oh!

4lexRed Jul 31, 2023

Replies: 2 comments · 2 replies

Uh oh!

JorjMcKie Jul 31, 2023 Maintainer

Uh oh!

Uh oh!

4lexRed Jul 31, 2023 Author

Uh oh!

JorjMcKie Jul 31, 2023 Maintainer

Uh oh!

4lexRed Jul 31, 2023 Author

4lexRed
Jul 31, 2023

Replies: 2 comments 2 replies

JorjMcKie
Jul 31, 2023
Maintainer

4lexRed Jul 31, 2023
Author

JorjMcKie
Jul 31, 2023
Maintainer

4lexRed Jul 31, 2023
Author