When I call `convert_to_pdf` it causes the resulting pdf font to be ghosted. #3356

izerui · 2024-04-07T08:25:09Z

izerui
Apr 7, 2024

When I call convert_to_pdf it causes the resulting pdf font to be ghosted.

result is:

    def test_ghost(self):
        bytes = httpx.get(
            'https://tfile.yj2025.com/pdf-processor/source/2024-04-07/mt_04_24024_0_812--1.pdf').content
        doc = fitz.open('pdf', bytes)
        for page in doc:
            page.clean_contents()
        doc = fitz.open('pdf', doc.convert_to_pdf())
        doc.save('xxx.pdf')

May I ask, is there a problem, or do I need to convert the time to set what parameters to achieve, it seems that pymupdf in the conversion of the time occurs ocr
thanks!!!

Answered by JorjMcKie

Apr 7, 2024

Just as I suspected! I have good news for you:

The base library has a solution for this, which is immediately available in PyMuPDF. There is a function that "bakes" annotations and fields (!!!) into the PDF - which means it converts these items into normal page content.
You have to bake the source PDF before using it in .show_pdf_page(). After baking, no annotations and no fields will exist anymore - but otherwise, every page will look exactly equal.

This is how it works:

src = fitz.open("source.pdf")  # this actually is a "fz_document" with an underlying PDF document
src_pdf = fitz.mupdf.pdf_document_from_fz_document(doc)  # access underlying PDF
fitz.mupdf.pdf_bake_document(src_pdf, 1, 1)…

View full answer

JorjMcKie · 2024-04-07T10:57:07Z

JorjMcKie
Apr 7, 2024
Maintainer

As documented under method .convert_to_pdf(), this is not guaranteed to always work. Frequent causes are font errors - which also happens in this case:

In [1]: import fitz
In [2]: doc = fitz.open("test.pdf")
In [3]: pdfdata = doc.convert_to_pdf()
cannot create ToUnicode mapping for NQAVHI+AdobeSongStd-Light
cannot create ToUnicode mapping for KJRDRO+FangSong_GB2312
cannot create ToUnicode mapping for LNBJFM+Symbol_ASME
cannot create ToUnicode mapping for PPUUMW+SimHei
In [4]: new = fitz.open("pdf", pdfdata)
In [5]: new.ez_save("converted.pdf")

Messages are from the MuPDF converter.
The result of the method should not be trusted in cases like this - and weird things happening during the conversion are not worth pursuing.

What are you trying to achieve anyway?

0 replies

izerui · 2024-04-07T12:07:49Z

izerui
Apr 7, 2024
Author

@JorjMcKie
I want to copy all the information in a pdf, including the annotations can be copied in the area of another pdf, because show_pdf_page() does not support the copy of the annotations, which may lead to the destination pdf to lose some of the necessary information, I use convert_to_pdf , I can get a new pdf, and will be able to convert the annotations into a method that can allow the annotations to work in the show_pdf_page, only the same duplicate lines as ocr, do you know if there are other programs to meet this need? show_pdf_page`, just the same duplicate lines as ocr, I wonder if you have any other program to meet this need? Thank you very much!

3 replies

JorjMcKie Apr 7, 2024
Maintainer

Just as I suspected! I have good news for you:

The base library has a solution for this, which is immediately available in PyMuPDF. There is a function that "bakes" annotations and fields (!!!) into the PDF - which means it converts these items into normal page content.
You have to bake the source PDF before using it in .show_pdf_page(). After baking, no annotations and no fields will exist anymore - but otherwise, every page will look exactly equal.

This is how it works:

src = fitz.open("source.pdf")  # this actually is a "fz_document" with an underlying PDF document
src_pdf = fitz.mupdf.pdf_document_from_fz_document(doc)  # access underlying PDF
fitz.mupdf.pdf_bake_document(src_pdf, 1, 1)  # bake annots and fields into the pages

# now the pages of src can be used in `show_pdf_page`.

The previous annotations and fields are visible in the target page as normal text (and drawings where applicable), and can thus be extracted like normal.

Answer selected by izerui

izerui Apr 7, 2024
Author

I went out for a meal and came back to such great news, thank you so much, I'm going to try it right away, thanks again!!!!

izerui Apr 7, 2024
Author

As you say, it works and is perfect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When I call `convert_to_pdf` it causes the resulting pdf font to be ghosted. #3356

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

When I call convert_to_pdf it causes the resulting pdf font to be ghosted. #3356

Uh oh!

izerui Apr 7, 2024

Replies: 2 comments · 3 replies

Uh oh!

JorjMcKie Apr 7, 2024 Maintainer

Uh oh!

izerui Apr 7, 2024 Author

Uh oh!

JorjMcKie Apr 7, 2024 Maintainer

Uh oh!

izerui Apr 7, 2024 Author

Uh oh!

izerui Apr 7, 2024 Author

When I call `convert_to_pdf` it causes the resulting pdf font to be ghosted. #3356

izerui
Apr 7, 2024

Replies: 2 comments 3 replies

JorjMcKie
Apr 7, 2024
Maintainer

izerui
Apr 7, 2024
Author

JorjMcKie Apr 7, 2024
Maintainer

izerui Apr 7, 2024
Author

izerui Apr 7, 2024
Author