Tibetan rendering amiss despite attempts with multiple fonts #1029

julkamny · 2021-04-21T15:58:38Z

julkamny
Apr 21, 2021

Hello,

First of all thanks for this amazing library, it has helped me achieve things I believed far beyond my ken. I am attempting to output a PDF which contains Tibetan words. For detailed information about the script used to write this language, please take a look at : https://docs.microsoft.com/en-us/typography/script-development/tibetan#examples-of-tibetan.

The issue

doc[0].insertTextbox(rect, "རྒྱསྒྲ", fontsize = fontsize, fontname="jomolhari", fontfile = jomolhari_font_path) produces the attached picture, whereas it should produce well, what hopefully you see on your screen : a neatly arranged stack of ར + ག + ཡ (the form of this last consonant changes when it is placed at the bottom of a stack) and a neatly arranged stack of ས + ག + ར. Even གྲ is not properly displayed.

My attempts

Multiple fonts, at least half a dozen, NotoSansTibetan, Yagpo, TibetanMachineUnicodeFont. The one that seems to work best is Jomolhari.
I run into the very same issue with Reportlab. This may hint at a problem on my side ? Or a general problem with Tibetan not tied to MuPDF specifically ?

My configuration

MacBook Air, High Sierra 10.13.6
PyMuPDF 1.18.12: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-04-10 04:00:00.
Built for Python 3.8 on darwin (64-bit).

JorjMcKie · 2021-04-21T16:05:42Z

JorjMcKie
Apr 21, 2021
Maintainer

This is not a bug - at best a feature you are missing.
let me convert this to an item in topic "Discussions".

1 reply

julkamny Apr 21, 2021
Author

I see, thanks for your prompt reply, glad to hear it's not a bug. I hope someone can provide a solution then !

JorjMcKie · 2021-04-21T16:12:51Z

JorjMcKie
Apr 21, 2021
Maintainer

This seems to be the same problem as #398.
Which is also open and unresolved since a long time.
I am afraid I cannot help you here: every single character's glyph is inserted one by one - instead of some looking-ahead algorithm, which takes 2 or more subsequent characters into account to potentially generate a different glyph.

0 replies

JorjMcKie · 2021-04-21T16:15:51Z

JorjMcKie
Apr 21, 2021
Maintainer

It also is an irritation, that office software like Word or LibreOffice produce the correct out.
I think you should submit an enhancement request to MuPDF via https://bugs.ghostscript.com/enter_bug.cgi

0 replies

JorjMcKie · 2021-04-21T16:24:32Z

JorjMcKie
Apr 21, 2021
Maintainer

The general problem seems to be this:
Every unicode has some glyph (visual representation). In certain scripts like yours or Devanagari, special (short, maybe 2 or 3) sequences of unicodes should be combined into one common glyph - instead of being written each as their normal separate glyph.
So the text output logic should know those unicode sequences (per font) and look out for them, instead of just outputting each unicode upon encountering it.

0 replies

robinwatts · 2022-06-16T14:18:53Z

robinwatts
Jun 16, 2022
Maintainer

This discussion is about "font shaping". You can google to get more information on this.

But to precis the process:

Software such as Word or an HTML browser, will keep a notional sequence of unicode chars. When these chars are to be displayed on the screen, this unicode sequence and font is fed into a 'shaper', which maps those unicode values to a sequence of glyphs to be displayed.

For many fonts (in particular for western languages) that's pretty much a 1-1 mapping, and each input unicode char will produce 1 displayed glyph.

Sometimes, however, we might map several unicode chars to a single glyph (consider the case of 'f' and 'i' being displayed as 'fi' (a ligature in many fonts)).

Likewise, you can even get cases where a single unicode char can produce multiple glyphs (I can't give an example of this off the top of my head, but it happens).

In the most general case, for any sequence of n input characters, you might get m output glyphs. These mappings will be different not only for different languages/scripts, but will also vary between fonts. (One font might choose a completely different way of decomposing complex shapes into glyphs than another one).

Consider also, that some languages are written left-to-right and others right-to-left. Word (and similar) cope with mapping source text to the positioned display of text to allow for such 'bidirectional' text.

So, how does this work with PDF? The simple answer is that PDF is not like Word (or similar). PDF is designed to be a display format, first and foremost. As such the information within a PDF file is at the glyph level, NOT the unicode char level. All the PDF contains is a list of glyphs, and where to put them. The shaping (and bidirectional handling) must have been done by the PDF producer.

In order to allow searching, PDF files (generally) contain extra information that maps back from glyph to unicode sequences, so you can recover the original sequence of text, but this should be thought of as being an afterthought. No handling of such information is required to get correct display of a PDF file.

So, to cut a long story short (too late!), the PDF producer needs to take care of this. As of MuPDF (and PyMuPDF) 1.20.0, there is no way to get this done automatically for you.

So, how can you do it? Well, internally to MuPDF, we use a library called HarfBuzz to do this shaping for us. We use it for our handling of epub files, which are basically HTML. Those need to be laid out and shaped onto the page.

There may well be a python wrapper for HarfBuzz so you might be able to process your input data and then feed the shaped output to PyMuPDF. I appreciate that that's a lot of work. Harfbuzz doesn't itself cope with bidirectional layout - that's an extra layer of complexity, but there may be python libraries out there to help you.

The good news though, is that the development branch of MuPDF itself has gained a new feature whereby 'text stories' can be fed into it, and can be laid out into specific areas on the page. This will enable you to generate PDF files with the shaping/bidirectionality all being taken care of for you. This will be available in MuPDF 1.21.0 later this year, and (almost certainly) will be exposed within PyMuPDF 1.21.0 too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tibetan rendering amiss despite attempts with multiple fonts #1029

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tibetan rendering amiss despite attempts with multiple fonts #1029

Uh oh!

Uh oh!

julkamny Apr 21, 2021

The issue

My attempts

My configuration

Replies: 5 comments · 1 reply

Uh oh!

JorjMcKie Apr 21, 2021 Maintainer

Uh oh!

julkamny Apr 21, 2021 Author

Uh oh!

JorjMcKie Apr 21, 2021 Maintainer

Uh oh!

JorjMcKie Apr 21, 2021 Maintainer

Uh oh!

JorjMcKie Apr 21, 2021 Maintainer

Uh oh!

robinwatts Jun 16, 2022 Maintainer

julkamny
Apr 21, 2021

Replies: 5 comments 1 reply

JorjMcKie
Apr 21, 2021
Maintainer

julkamny Apr 21, 2021
Author

JorjMcKie
Apr 21, 2021
Maintainer

JorjMcKie
Apr 21, 2021
Maintainer

JorjMcKie
Apr 21, 2021
Maintainer

robinwatts
Jun 16, 2022
Maintainer