Can TextWriter embed a subset of a font automatically? #1910

cbm755 · 2022-09-09T00:12:03Z

cbm755
Sep 9, 2022

Background: this code puts two Chinese characters on the page. File size, 4KiB.

from fitz import *

doc = fitz.open()
pg = doc.new_page()

excess = pg.insert_textbox(Rect(0,50,200,150), "我爱PyMuPDF!", fontname="china-ss")
print(excess)
assert excess > 0

print(pg.get_fonts())
doc.save("foo.pdf")

This one does something very similar but file size is 3.5MiB:

from fitz import *

doc = fitz.open()
pg = doc.new_page()

r = Rect(0, 100, 300, 200)
tw = TextWriter(pg.rect)
tw.append(Point(100, 40), "我爱PyMuPDF!", fontsize=10)
pg.write_text(rect=r, writers=tw)
pg.draw_rect(r)

print(pg.get_fonts())
doc.save("foo.pdf")

I know the comparison is not fair b/c the first one does not embed fonts and (I'm supposing) does not work reliably on printers, for example. Adobe Acrobat users have to download extra stuff. There are various advantages of the 2nd option but file size is certainly not one of them.

Is it do-able for PyMuPDF to embed a subset? I realize this gets really tricky when one uses multiple TextWriters over multiple pages.

But to keep it simple, suppose I had just one TextWriter. In my use-case its to "stamp" folks' names on the front of an existing files.

Answered by JorjMcKie

Sep 9, 2022

The Document method subset_fonts() is independent from TextWriter and always works. It will walk through all the PDF pages and collect all their characters by font - but only for those fonts that are no subsets already.
Then present each font with all used chars in the file to fonttools and let it compute a subset font.
If successful (should work for OTF, TTF and WOFF fonts), then the subset fontfile is used to replace the original. Also the font (base) name is prefixed with that PDF-specific 6 character prefix.

View full answer

cbm755 · 2022-09-09T00:25:02Z

cbm755
Sep 9, 2022
Author

Docs:

Actually, you would rarely ever need another sans-serif font than “Droid Sans Fallback Regular”. Except that this font file is relatively large and adds about 1.65 MB (compressed) to your PDF file size.

Hmmm, so why is my example 3.5MiB?

1 reply

JorjMcKie Sep 9, 2022
Maintainer

Hmmm, so why is my example 3.5MiB?

because you refuse to compress your PDF properly when saving 😉. Simply use doc.ez_save() ("easy save") which employs some other default saving options: garbage=3, deflate=True.
This makes your PDF size 1.5 MB.
If you also do doc.subset_fonts() before saving, then only the smallest possible part of the CJK font will be left over, giving this print out:

Built subset of font 'Droid Sans Fallback Regular'.
[(6,
  'ttf',
  'Type0',
  'BTZPSZ+Droid Sans Fallback Regular',
  'F0',
  'Identity-H',
  18),
 (12, 'cid', 'Type0', 'Helvetica', 'F1', 'Identity-H', 18)]

and a total file size of 36 KB (!).
Where the compressed size of the CJK font subset is 26933 bytes. The character prefix BTZPSZ+ indentifies this as a subset.
You must have installed fonttools https://pypi.org/project/fonttools/ for this to work.

Unfortunately, the Base-14 font replacements used by TextWriter are CID fonts, which cannot be subsetted - at least not by fontTools, which I am using.

JorjMcKie · 2022-09-09T07:09:47Z

JorjMcKie
Sep 9, 2022
Maintainer

The Document method subset_fonts() is independent from TextWriter and always works. It will walk through all the PDF pages and collect all their characters by font - but only for those fonts that are no subsets already.
Then present each font with all used chars in the file to fonttools and let it compute a subset font.
If successful (should work for OTF, TTF and WOFF fonts), then the subset fontfile is used to replace the original. Also the font (base) name is prefixed with that PDF-specific 6 character prefix.

0 replies

cbm755 · 2022-09-09T16:42:53Z

cbm755
Sep 9, 2022
Author

Very nice, and very complete answer. Sorry for not RTFM on subset_fonts.

One thing that scares me a bit about garbage=3 and subset_fonts(), is that both will presumably muck around with my existing PDF (in a real example, I load a PDF, write some stuff on it, then save again). Ideally that would involve minimal "touching" of the upstream PDF file, but maybe this is not realistic!

In particular, if the upstream PDF did not subset their fonts, it seems "rude"/dangerous for me to do so. Maybe this concern is unfounded. Is it realistic to request a allow-list for subset_fonts to muck around with, something like doc.subset_fonts(elligible_fonts=['Droid Sans Fallback Regular'])?

1 reply

JorjMcKie Sep 9, 2022
Maintainer

Maybe this concern is unfounded.

I am fairly sure this is so. It is a lot more probable that fonts from upstream already are subsetted. You may also watch a few example runs to confirm that no unexpected fonts are announced in the message.
As far as the garbage=3 is concerned: this is not debatable anyway, because removal of unused stuff only ever happens with a garbage option > 2. So even if a font selection option would exist, the old fontfile would remain in the file without garbage collecting it.
And beause garbage collection is a democratic function, all objects benefitting from it would be changed. Similar is true for compression - you may downselect to only deflating images or only fonts, but you cannot be more granular than that.

Anyway, restricting font subsetting to only certain fonts is nothing that you can expect to be implemented any time soon. Not in the next version for sure.

If "politeness" is a real concern, you should probably not save the PDF to a new file at all, but use doc.saveIncr(). Your changes would thus be appended and not change any part of the original file. Of course you would want to keep the data volume of your changes small in that case: no big new fonts, etc. Probably not even using TextWriter ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can TextWriter embed a subset of a font automatically? #1910

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can TextWriter embed a subset of a font automatically? #1910

Uh oh!

cbm755 Sep 9, 2022

Replies: 3 comments · 2 replies

Uh oh!

cbm755 Sep 9, 2022 Author

Uh oh!

JorjMcKie Sep 9, 2022 Maintainer

Uh oh!

JorjMcKie Sep 9, 2022 Maintainer

Uh oh!

cbm755 Sep 9, 2022 Author

Uh oh!

JorjMcKie Sep 9, 2022 Maintainer

cbm755
Sep 9, 2022

Replies: 3 comments 2 replies

cbm755
Sep 9, 2022
Author

JorjMcKie Sep 9, 2022
Maintainer

JorjMcKie
Sep 9, 2022
Maintainer

cbm755
Sep 9, 2022
Author

JorjMcKie Sep 9, 2022
Maintainer