A document has every space encoded as � #2689

nikitar · 2023-08-22T03:37:36Z

nikitar
Aug 22, 2023

When using page.get_text_blocks on a specific document (attached), every single space becomes a question mark (65533), e.g. "The�Count�of�Monte�Cristo\n". I'm aware that this is how mupdf/pymupdf denotes glyphs it cannot understand, but it's odd that the same document can be read fine with apple's Preview and google's Chrome/pdfium.

797The-Count-of-Monte-Cristo.pdf

To Reproduce (mandatory)

    flags = (fitz.TEXT_DEHYPHENATE | fitz.TEXT_MEDIABOX_CLIP)
    with fitz.open(PDF_PATH) as doc:
        page = doc[0]
        blocks = page.get_text_blocks(flags=flags)
        for i, block in enumerate(blocks):
            print(f'{i} - {block}')

Your configuration (mandatory)

Mac and ubuntu
Python 3.11

3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] 
 darwin 
 
PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.11 on darwin (64-bit).

Additional question

Does (py)mupdf have any sort of 'drop invalid characters' option? Ideally we'd drop both �'s and others (e.g. split surrogates from #2608, or Private Use ones such as U+10FC31).

Of course, I can do string.replace('\uFFFD', ' '), but that messes with page.get_text_words result, plus I'd need to compile a list of all 'bad characters', which seems wrong.

JorjMcKie · 2023-08-22T13:36:38Z

JorjMcKie
Aug 22, 2023
Maintainer

Confirming the issue. Also happens with the base library. I am forwarding this to the base library's issue system, ok?

As per your question:
Interesting idea. I thought about a related option specifically for the "words" extraction option: accept additional characters as word delimiters, like string.whitespace or string.punctuation or unbreakable spaces etc.
Adding an option for characters that automatically get converted to space goes in a similar direction. However: in your example you might get away with translating 0xfffd to 0x20, but that is a special case, for which hopefully find a better solution.

0 replies

JorjMcKie · 2023-08-22T14:18:08Z

JorjMcKie
Aug 22, 2023
Maintainer

For your information: this is the issue entered on MuPDF's bug tracker.

0 replies

JorjMcKie · 2023-08-25T11:19:39Z

JorjMcKie
Aug 25, 2023
Maintainer

I looked a bit deeper in the problem.
Something is odd with the only used two fonts "LiberationSerif" and "LiberationSerif-Bold". So I made a little script, that replaces these fonts with fresh full versions - just to see what happens.
The result is a book which can be text-extracted in the normal way.
Just a few special characters are not displayed correctly like long hyphens, apostrophes etc.
But maybe it helps you in the meantime.
Here is the script together with the required fonts:
change-fonts.zip

0 replies

JorjMcKie · 2023-08-25T23:52:26Z

JorjMcKie
Aug 25, 2023
Maintainer

I did more analysis about your case.
Throughout all pages, only two font are being used: "Liberation Serif" (Regular) and "Liberation Serif Bold". Both come with a so-called CMAP (Character Map) that backtranslates glyphs (the visual appearance constructors) to unicodes.
This mechanism is crucial for being able to extract text at all.

Now comes the point:
In both fonts the glyph number 0003 is used when a space should be written. But this glyph number points to unicode 9 (\t) which is not contained in the font and thus consequently returned as 0xFFFD by MuPDF.
I have made another script that changes the CMAP entry for 0003 directly to 0020, the unicode number of space.
So instead of a CMAP looking like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapName LiberationSerif-cmap def
/CMapType 2 def
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
100 beginbfchar
<0003> <0009>    % <---- points to invalid unicode / character 
<0004> <0021>
<0005> <0022>
...

... the following CMAP is being used:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapName LiberationSerif-cmap def
/CMapType 2 def
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
100 beginbfchar
<0003> <0020>    % <---- glyph number 3 points to space now.
<0004> <0021>
<0005> <0022>

This change (made for both used fonts) removes your problem: every text is extracted as expected.

So why do other text extraction software produce results in this situation?

It seems that this is simply the result of guesswork. The original full file Liberation Serif fonts do map glyph number 3 to space.

I haven't received a note from MuPDF's issue system, so I guess we will have to wait for a definite answer from them.
But it looks like this is no bug of (Py-) MuPDF.

0 replies

nikitar · 2023-08-28T01:43:43Z

nikitar
Aug 28, 2023
Author

Thanks for the details @JorjMcKie!

The document itself is not important. I just grabbed a bunch of random documents and reported the two most noticeable issues. (In case it might help make (py)mupdf better)

That said, as someone not intimately familiar with unicode and fonts, your analysis was very educational!

0 replies

JorjMcKie · 2023-09-25T15:16:46Z

JorjMcKie
Sep 25, 2023
Maintainer

I am going to move this to "Discussions" - as we have clarified, there is no bug (except in the PDF itslf).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A document has every space encoded as � #2689

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

A document has every space encoded as � #2689

Uh oh!

nikitar Aug 22, 2023

To Reproduce (mandatory)

Your configuration (mandatory)

Additional question

Replies: 6 comments

Uh oh!

JorjMcKie Aug 22, 2023 Maintainer

Uh oh!

JorjMcKie Aug 22, 2023 Maintainer

Uh oh!

JorjMcKie Aug 25, 2023 Maintainer

Uh oh!

JorjMcKie Aug 25, 2023 Maintainer

Uh oh!

nikitar Aug 28, 2023 Author

Uh oh!

JorjMcKie Sep 25, 2023 Maintainer

nikitar
Aug 22, 2023

JorjMcKie
Aug 22, 2023
Maintainer

JorjMcKie
Aug 22, 2023
Maintainer

JorjMcKie
Aug 25, 2023
Maintainer

JorjMcKie
Aug 25, 2023
Maintainer

nikitar
Aug 28, 2023
Author

JorjMcKie
Sep 25, 2023
Maintainer