Decoding html Text #2713
Answered
by
JorjMcKie
mahdyshabeeb
asked this question in
Looking for help
Decoding html Text
#2713
-
Dear fitz community, I have a question regarding interpreting the output of the I'm trying to get all the text and the images of a certain page and that's why I'm using this method. However, I am trying to pass the resulting images to the Image.open() method of the python PIL library but this is not working. I am applying the following steps:
Is there any special preprocessing needed or a correct way to convert these html image representations to bytes? |
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Oct 3, 2023
Replies: 1 comment 1 reply
-
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
mahdyshabeeb
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Images are embedded in the generated html source like this:
This means that the string (!) following "base64," is encoded in base64 format. You must parse that string completely and decode it (e.g. binascii module or b64_decode) into your
bytes
object.Then define
fp = io.BytesIO(obj.getvalue())
and doimg = PIL.Image.open(fp)
.