Text Extraction from PDF Results in Garbled Characters #3801

hrhktkbzyy · 2024-08-22T08:13:02Z

hrhktkbzyy
Aug 22, 2024

Description of the bug

When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.

I'm unsure if this is related to encoding settings or if there is a way to correct this behavior. Any guidance or potential fixes would be appreciated.

How to reproduce the bug

PDF is as attached

king arthur.pdf
The Phantom of the Opera.pdf

Code Sample:

import fitz

def get_text_from_pdf_by_pymupdf(file_path):
    try:
        text = ''
        pages = fitz.open(file_path.absolute())
        number_of_pages = len(pages)
        for page_obj in pages:
            text_add = page_obj.get_text()
            if text_add:
                text += text_add
        return text, number_of_pages

    except Exception as e:
        print(e)
        return None, None

Output:

The extracted content includes cid values such as:

7KH�GDQFHUV

4XLFN�� 4XLFN�� &ORVH� WKH� GRRU�� ,W
V� KLP�
� $QQLH� 6RUHOOL� UDQ� LQWR� WKH
GUHVVLQJ�URRP��KHU�IDFH�ZKLWH�
2QH� RI� WKH� JLUOV� UDQ� DQG� FORVHG� WKH� GRRU�� DQG� WKHQ� WKH\� DOO� WXUQHG� WR
$QQLH�6RUHOOL�

PyMuPDF version

1.24.9

Operating system

MacOS

Python version

3.12

JorjMcKie · 2024-08-22T13:04:15Z

JorjMcKie
Aug 22, 2024
Maintainer

This is no bug, but goes back to properties / deficiencies of the used font(s).
If a glyph contains no back-reference to the Unicode that originated it, then there is no way to determine the Unicode.
This is what is happening in every case where a � appears.

In addition, PyMuPDF's default extraction flags use the glyph number instead of the Unicode then the Unicode's value is 0xFFFD (which delivers that �). So you can try the etraction using flags=0 and see what happens instead.

But as you report: when other extractors also deliver crab, then we have just bad luck!

1 reply

JorjMcKie Aug 22, 2024
Maintainer

Converted this to a Discussions item, as we clearly have no bug.

JorjMcKie · 2024-08-22T13:08:36Z

JorjMcKie
Aug 22, 2024
Maintainer

The only way out I see is using OCR ...

1 reply

ASTimch Aug 22, 2024

The only way out I see is using OCR ...

One question: if I know that internal font glyphs contain wrong unicode mapping. If I know that to fix the mapping a have to add some shift to some glyph codes. Can I somehow change/fix font glyph mapping using pymupdf?
Thank you.

JorjMcKie · 2024-08-22T15:38:47Z

JorjMcKie
Aug 22, 2024
Maintainer

All data in a PDF, including object definitions and binary content is available for low-level access and update.
Stream objects (the ones with potentially binary content) can be extracted and updated via oldcontent = doc.xref_stream(xref) / doc.update_stream(xref, newcontent).
Depending on the font, there may exist a CMAP (character map), available as object property /ToUnicode. This is a stream object which you could update as desired.
You are on yourself here though, what this entails. You must read the PDF spec before you try this.
When done right, you are in control about which glyph number back-translates into which Unicode.
There is no recipe in PyMuPDF helping you here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction from PDF Results in Garbled Characters #3801

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Text Extraction from PDF Results in Garbled Characters #3801

Uh oh!

Uh oh!

hrhktkbzyy Aug 22, 2024

Description of the bug

How to reproduce the bug

PDF is as attached

Code Sample:

Output:

PyMuPDF version

Operating system

Python version

Replies: 3 comments · 2 replies

Uh oh!

JorjMcKie Aug 22, 2024 Maintainer

Uh oh!

JorjMcKie Aug 22, 2024 Maintainer

Uh oh!

JorjMcKie Aug 22, 2024 Maintainer

Uh oh!

Uh oh!

ASTimch Aug 22, 2024

Uh oh!

JorjMcKie Aug 22, 2024 Maintainer

hrhktkbzyy
Aug 22, 2024

Replies: 3 comments 2 replies

JorjMcKie
Aug 22, 2024
Maintainer

JorjMcKie Aug 22, 2024
Maintainer

JorjMcKie
Aug 22, 2024
Maintainer

JorjMcKie
Aug 22, 2024
Maintainer