Document.tobytes() maybe not convert it to binary data??? #2236
Unanswered
jscodecode
asked this question in
Looking for help
Replies: 2 comments 3 replies
-
Here is a major misconception: |
Beta Was this translation helpful? Give feedback.
1 reply
-
Well you must extract the text. This will be in UTF-8 encoding, so should be a no-brainer: doc = fitz.open("tst.pdf")
text = chr(12).join([page.get_text(sort=True) for page in doc]) The result, |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is my code:
doc = fitz.open('test.pdf')
doc = doc.tobytes()
doc = doc.decode(encoding='utf-8')
Threr is an error:
'utf-8' codec can't decode byte 0x9c in position 521: invalid start byte.
I see the interface.
tobytes(garbage=0, clean=False, deflate=False, deflate_images=False, deflate_fonts=False, ascii=False, expand=0, linear=False, pretty=False, no_new_id=False, encryption=PDF_ENCRYPT_NONE, permissions=-1, owner_pw=None, user_pw=None)
ascii (bool) – convert binary data to ASCII.
Document.tobytes() maybe not convert it to binary data???
How can I solve this problems?
Beta Was this translation helpful? Give feedback.
All reactions