How to reduce the file size of the extracted html? #1554

DJay921 · 2022-01-23T07:09:34Z

DJay921
Jan 23, 2022

The quality of the extracted html output for PyMuPDF is far better than what I was getting using some of the other libraries like PDBox wrapper for python. However, one concern I have is regarding the output file size which is quite larger (1.5 MB) as compared to the other option (400 KB). I am using the flag to skip images using not fitz.TEXT_PRESERVE_IMAGES . Apart from this, how can I further reduce the size of the output html file? I'm looking for minified versions of the html code. Thanks. I want to preserve the whitespaces if possibly since the PDF contains a few tables as well.

JorjMcKie · 2022-01-23T09:14:26Z

JorjMcKie
Jan 23, 2022
Maintainer

This is a thin wrapper of an original MuPDF function. So there is no way for me to influence the output, sorry.
Maybe there are postprocessors on the market, that offer syntax optimizations (tidy?), but I really don't know much about this area.

1 reply

DJay921 Jan 23, 2022
Author

Sure, thanks @JorjMcKie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to reduce the file size of the extracted html? #1554

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to reduce the file size of the extracted html? #1554

Uh oh!

Uh oh!

DJay921 Jan 23, 2022

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Jan 23, 2022 Maintainer

Uh oh!

DJay921 Jan 23, 2022 Author

DJay921
Jan 23, 2022

Replies: 1 comment 1 reply

JorjMcKie
Jan 23, 2022
Maintainer

DJay921 Jan 23, 2022
Author