Tibetan rendering amiss despite attempts with multiple fonts #1029
Replies: 5 comments 1 reply
-
This is not a bug - at best a feature you are missing. |
Beta Was this translation helpful? Give feedback.
-
This seems to be the same problem as #398. |
Beta Was this translation helpful? Give feedback.
-
It also is an irritation, that office software like Word or LibreOffice produce the correct out. |
Beta Was this translation helpful? Give feedback.
-
The general problem seems to be this: |
Beta Was this translation helpful? Give feedback.
-
This discussion is about "font shaping". You can google to get more information on this. But to precis the process: Software such as Word or an HTML browser, will keep a notional sequence of unicode chars. When these chars are to be displayed on the screen, this unicode sequence and font is fed into a 'shaper', which maps those unicode values to a sequence of glyphs to be displayed. For many fonts (in particular for western languages) that's pretty much a 1-1 mapping, and each input unicode char will produce 1 displayed glyph. Sometimes, however, we might map several unicode chars to a single glyph (consider the case of 'f' and 'i' being displayed as 'fi' (a ligature in many fonts)). Likewise, you can even get cases where a single unicode char can produce multiple glyphs (I can't give an example of this off the top of my head, but it happens). In the most general case, for any sequence of n input characters, you might get m output glyphs. These mappings will be different not only for different languages/scripts, but will also vary between fonts. (One font might choose a completely different way of decomposing complex shapes into glyphs than another one). Consider also, that some languages are written left-to-right and others right-to-left. Word (and similar) cope with mapping source text to the positioned display of text to allow for such 'bidirectional' text. So, how does this work with PDF? The simple answer is that PDF is not like Word (or similar). PDF is designed to be a display format, first and foremost. As such the information within a PDF file is at the glyph level, NOT the unicode char level. All the PDF contains is a list of glyphs, and where to put them. The shaping (and bidirectional handling) must have been done by the PDF producer. In order to allow searching, PDF files (generally) contain extra information that maps back from glyph to unicode sequences, so you can recover the original sequence of text, but this should be thought of as being an afterthought. No handling of such information is required to get correct display of a PDF file. So, to cut a long story short (too late!), the PDF producer needs to take care of this. As of MuPDF (and PyMuPDF) 1.20.0, there is no way to get this done automatically for you. So, how can you do it? Well, internally to MuPDF, we use a library called HarfBuzz to do this shaping for us. We use it for our handling of epub files, which are basically HTML. Those need to be laid out and shaped onto the page. There may well be a python wrapper for HarfBuzz so you might be able to process your input data and then feed the shaped output to PyMuPDF. I appreciate that that's a lot of work. Harfbuzz doesn't itself cope with bidirectional layout - that's an extra layer of complexity, but there may be python libraries out there to help you. The good news though, is that the development branch of MuPDF itself has gained a new feature whereby 'text stories' can be fed into it, and can be laid out into specific areas on the page. This will enable you to generate PDF files with the shaping/bidirectionality all being taken care of for you. This will be available in MuPDF 1.21.0 later this year, and (almost certainly) will be exposed within PyMuPDF 1.21.0 too. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
First of all thanks for this amazing library, it has helped me achieve things I believed far beyond my ken. I am attempting to output a PDF which contains Tibetan words. For detailed information about the script used to write this language, please take a look at : https://docs.microsoft.com/en-us/typography/script-development/tibetan#examples-of-tibetan.
The issue
doc[0].insertTextbox(rect, "རྒྱསྒྲ", fontsize = fontsize, fontname="jomolhari", fontfile = jomolhari_font_path) produces the attached picture, whereas it should produce well, what hopefully you see on your screen : a neatly arranged stack of ར + ག + ཡ (the form of this last consonant changes when it is placed at the bottom of a stack) and a neatly arranged stack of ས + ག + ར. Even གྲ is not properly displayed.
My attempts
My configuration
Version date: 2021-04-10 04:00:00.
Built for Python 3.8 on darwin (64-bit).
Beta Was this translation helpful? Give feedback.
All reactions