Document.tobytes() maybe not convert it to binary data??? #2236

jscodecode · 2023-02-14T12:17:52Z

jscodecode
Feb 14, 2023

This is my code:
doc = fitz.open('test.pdf')
doc = doc.tobytes()
doc = doc.decode(encoding='utf-8')
Threr is an error:
'utf-8' codec can't decode byte 0x9c in position 521: invalid start byte.

I see the interface.
tobytes(garbage=0, clean=False, deflate=False, deflate_images=False, deflate_fonts=False, ascii=False, expand=0, linear=False, pretty=False, no_new_id=False, encryption=PDF_ENCRYPT_NONE, permissions=-1, owner_pw=None, user_pw=None)
ascii (bool) – convert binary data to ASCII.

Document.tobytes() maybe not convert it to binary data???
How can I solve this problems?

JorjMcKie · 2023-02-14T12:56:40Z

JorjMcKie
Feb 14, 2023
Maintainer

Here is a major misconception: Document.tobytes() is a PDF in memory, not text. Therefore it must be binary.
What do you want to achieve? Text maybe?

1 reply

jscodecode Feb 14, 2023
Author

yes. I need to change it to text in utf-8，but my code is error in “doc.decode（encoding=“utf-8”）.
I think the reason is bytes of document.
There are some bytes，which are not binary like “0x9C” in error tips.
How can i solve this problem？Thank you very much！

JorjMcKie · 2023-02-14T13:34:27Z

JorjMcKie
Feb 14, 2023
Maintainer

Well you must extract the text. This will be in UTF-8 encoding, so should be a no-brainer:

doc = fitz.open("tst.pdf")
text = chr(12).join([page.get_text(sort=True) for page in doc])

The result, text is a UTF-8 string with formfeed character (chr(12)) between pages.

2 replies

JorjMcKie Feb 14, 2023
Maintainer

The tobytes() method is something entirely different: it does the same thing as doc.save() except it won't write to disk but to memory.

jscodecode Feb 14, 2023
Author

yes. I konw that by reading your reffernce of pymupdf in website.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document.tobytes() maybe not convert it to binary data??? #2236

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Document.tobytes() maybe not convert it to binary data??? #2236

Uh oh!

jscodecode Feb 14, 2023

Replies: 2 comments · 3 replies

Uh oh!

JorjMcKie Feb 14, 2023 Maintainer

Uh oh!

jscodecode Feb 14, 2023 Author

Uh oh!

JorjMcKie Feb 14, 2023 Maintainer

Uh oh!

JorjMcKie Feb 14, 2023 Maintainer

Uh oh!

jscodecode Feb 14, 2023 Author

jscodecode
Feb 14, 2023

Replies: 2 comments 3 replies

JorjMcKie
Feb 14, 2023
Maintainer

jscodecode Feb 14, 2023
Author

JorjMcKie
Feb 14, 2023
Maintainer

JorjMcKie Feb 14, 2023
Maintainer

jscodecode Feb 14, 2023
Author