How to get bytes instead of str inside span? #2834

ivanstepanovftw · 2023-11-22T10:43:04Z

ivanstepanovftw
Nov 22, 2023

I need to get raw bytes that are inside span. span["text"] is of type str. span["chars"][N]["c"] is of type str too. What should I do to get raw bytes?

INVALID_UNICODE = chr(0xFFFD)  # the "Invalid Unicode" character
doc = fitz.fitz.Document("pdfs/arabic.pdf")

for page in doc:
    blocks = fitz.utils.get_text(
        page=page,
        option="dict",
        flags=0
    )["blocks"]
    for block in blocks:
        for line in block["lines"]:
            for span in line["spans"]:
                text = span["text"]
                if INVALID_UNICODE in text:
                    # parse error

PDF: http://tug.ctan.org/macros/latex/exptl/mem/arabic.pdf

This script is from https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract2.py, and the idea I am wanted to do is to extract bytes from span character and try to search for relevant glyph in the current font (that is of .cff file type), then render the glyph for OCR.
FontForge screenshot:

JorjMcKie · 2023-11-22T11:11:46Z

JorjMcKie
Nov 22, 2023
Maintainer

This no issue, but a Discussions post. Transferring ...

6 replies

ivanstepanovftw Nov 22, 2023
Author

I am asking about bytes, not str. While the library always using str, I am getting \xEF\xBF\xBD, that is the UTF-8 encoding for the unicode character U+FFFD (Replacement character).

Does PDF contain this string?
Or it is a missing PyMuPDF feature for extracting raw bytes?
Or is it a bug? I see a relevant issue #365

ivanstepanovftw Nov 22, 2023
Author

The PDF I am referring to does not contain "Replacement character".

JorjMcKie Nov 22, 2023
Maintainer

Ah, ok. The replacement character 0xfffd is generated. This happens in cases when for example surrogate Unicodes have been used illegally, or glyph numbers have been used for which the font has no backtranslation to unicode numbers.
When this happens, there is no way to access the original bytes in the PDF.

ivanstepanovftw Nov 22, 2023
Author

I see that PDF indeed contains raw bytes in the complex script.

And it is not replacement character:

Added issue #2835

ivanstepanovftw Nov 22, 2023
Author

Even Google Chrome can do this, but as a text, and, possibly, not raw bytes.

How to get bytes instead of str inside span? #2834

Uh oh!

Uh oh!

ivanstepanovftw Nov 22, 2023

Replies: 1 comment · 6 replies

Uh oh!

JorjMcKie Nov 22, 2023 Maintainer

Uh oh!

Uh oh!

ivanstepanovftw Nov 22, 2023 Author

Uh oh!

ivanstepanovftw Nov 22, 2023 Author

Uh oh!

Uh oh!

JorjMcKie Nov 22, 2023 Maintainer

Uh oh!

Uh oh!

ivanstepanovftw Nov 22, 2023 Author

Uh oh!

ivanstepanovftw Nov 22, 2023 Author

ivanstepanovftw
Nov 22, 2023

Replies: 1 comment 6 replies

JorjMcKie
Nov 22, 2023
Maintainer

ivanstepanovftw Nov 22, 2023
Author

ivanstepanovftw Nov 22, 2023
Author

JorjMcKie Nov 22, 2023
Maintainer

ivanstepanovftw Nov 22, 2023
Author

ivanstepanovftw Nov 22, 2023
Author