Decoding html Text #2713

mahdyshabeeb · 2023-10-03T20:23:00Z

mahdyshabeeb
Oct 3, 2023

Dear fitz community,

I have a question regarding interpreting the output of the page.get_text() method when using the "html" option.

I'm trying to get all the text and the images of a certain page and that's why I'm using this method. However, I am trying to pass the resulting images to the Image.open() method of the python PIL library but this is not working. I am applying the following steps:

Using page.get_text("html") to get the html representation of the page
Parsing the html output with BeatifulSoup
Getting all Tags of the html content whose name is "img" (i.e. the images)
Now for each one of these images, I am converting it to bytes
Finally I pass the resulting bytes to the PIL.Image.open() method.

Is there any special preprocessing needed or a correct way to convert these html image representations to bytes?

Answered by JorjMcKie

Oct 3, 2023

Images are embedded in the generated html source like this:

This means that the string (!) following "base64," is encoded in base64 format. You must parse that string completely and decode it (e.g. binascii module or b64_decode) into your bytes object.
Then define fp = io.BytesIO(obj.getvalue()) and do img = PIL.Image.open(fp).

View full answer

JorjMcKie · 2023-10-03T21:07:42Z

JorjMcKie
Oct 3, 2023
Maintainer

Images are embedded in the generated html source like this:

This means that the string (!) following "base64," is encoded in base64 format. You must parse that string completely and decode it (e.g. binascii module or b64_decode) into your bytes object.
Then define fp = io.BytesIO(obj.getvalue()) and do img = PIL.Image.open(fp).

1 reply

mahdyshabeeb Oct 4, 2023
Author

Great! This works. Thank you! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoding html Text #2713

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decoding html Text #2713

Uh oh!

mahdyshabeeb Oct 3, 2023

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Oct 3, 2023 Maintainer

Uh oh!

mahdyshabeeb Oct 4, 2023 Author

mahdyshabeeb
Oct 3, 2023

Replies: 1 comment 1 reply

JorjMcKie
Oct 3, 2023
Maintainer

mahdyshabeeb Oct 4, 2023
Author