Some missing spaces in get_text output #2440

henrygriffiths · 2023-05-31T15:57:50Z

henrygriffiths
May 31, 2023

Describe the bug (mandatory)

Output from .get_text is missing some random spaces between words on the same line of the text in the PDF.

To Reproduce (mandatory)

import fitz
doc = fitz.open('file.pdf`)
for page in doc:
  for block in page.get_text("dict", flags=31)["blocks"]:
    print(block)

Expected behavior (optional)

Text contains all the spaces that the PDF does.
eg. The quick brown fox jumps over the lazy dog
is output instead as Thequick brown fox jumps overthe lazy dog (Removing spaces on the same PDF line)

Screenshots (optional)

N/A

Your configuration (mandatory)

Operating system, potentially version and bitness : Linux 6.3.3-arch1-1 x86_64
Python version, bitness : Python 3.10.11 (main, May 25 2023, 13:44:59) [GCC 13.1.1 20230429] x86_64
PyMuPDF version, installation method (wheel or generated from source) : 1.22.3 installed from wheel (using pip 23.1.2, setuptools 67.8.0 and wheel 0.40.0

Additional context (optional)

I have reviewed the bug report from #456 and #364 and tested using mutool as recommended. Using mutool 1.22.0 (as is used by PyMuPDF 1.22.3), the output of the PDF (using mutool draw -o test.html file.pdf 1) contains all of the spaces.
I am unsure if this is a duplicate of #2400, as I don't have enough information to determine if the same issue (an empty gap between those spaces), and I apologize if it is.

Answered by JorjMcKie

Jun 4, 2023

Why are you using the flags value 31? Its bit decomposition is '0b11111', which, among other things, suppresses the corrective MuPDF action that inserts spaces where deemed beneficial ...
IAW you are setting fitz.TEXT_INHIBIT_SPACES.

Here is what I get as a result:

In [1]: import fitz

In [2]: doc=fitz.open("en.company_presentation.pdf")

In [3]: page=doc[1]

In [4]: print(page.get_text(sort=True))  # sort option internally causes "blocks" extraction
We are creators and makers of technology
One of the world’s largest semiconductor companies
$16.1 billion revenues
in 2022
Over 50,000 employees
of which 9,000+ in R&D
14 main manufacturing
sites
Over 80 sales & marketing
offices serving over 2…

View full answer

JorjMcKie · 2023-06-01T07:34:20Z

JorjMcKie
Jun 1, 2023
Maintainer

This is a "Discussions" item, so I transfer.

0 replies

JorjMcKie · 2023-06-01T07:37:55Z

JorjMcKie
Jun 1, 2023
Maintainer

Please provide an example document page and the Python code snippet.

3 replies

henrygriffiths Jun 3, 2023
Author

PDF:
https://www.st.com/content/ccc/resource/corporate/company/company_presentation/8d/fc/ba/0b/41/0d/47/12/company_presentation.pdf/files/company_presentation.pdf/jcr:content/translations/en.company_presentation.pdf

Code Snippet:

doc = fitz.open('en.company_presentation.pdf')
for pagenum, page in enumerate(doc):
    block = page.get_text("text", flags = 31)
    print(block)

Examples:
Page 2:
Over 50,000 employees turns into Over50,000employees
Over 80 sales & marketing turns into Over 80sales & marketing

JorjMcKie Jun 4, 2023
Maintainer

Why are you using the flags value 31? Its bit decomposition is '0b11111', which, among other things, suppresses the corrective MuPDF action that inserts spaces where deemed beneficial ...
IAW you are setting fitz.TEXT_INHIBIT_SPACES.

Here is what I get as a result:

In [1]: import fitz

In [2]: doc=fitz.open("en.company_presentation.pdf")

In [3]: page=doc[1]

In [4]: print(page.get_text(sort=True))  # sort option internally causes "blocks" extraction
We are creators and makers of technology
One of the world’s largest semiconductor companies
$16.1 billion revenues
in 2022
Over 50,000 employees
of which 9,000+ in R&D
14 main manufacturing
sites
Over 80 sales & marketing
offices serving over 200,000
customers across the globe
Signatory of the United Nations Global Compact (UNGC)
Member of the Responsible Business Alliance (RBA)
As of December 31, 2022
2


In [5]: print(page.get_text(sort=True,flags=31))
We are creators and makers of technology
One of the world’s largest semiconductor companies
$16.1billion revenues
in 2022
Over50,000employees
of which 9,000+ in R&D
14main manufacturing
sites
Over 80sales & marketing
offices serving over 200,000
customers across the globe
Signatory of the United Nations Global Compact (UNGC)
Member of the Responsible Business Alliance (RBA)
As of December 31, 2022
2
<image: DeviceRGB, width: 1360, height: 841, bpc: 8>

In [6]: print(fitz.__doc__)

PyMuPDF 1.22.3: Python bindings for the MuPDF 1.22.0 library.
Version date: 2023-05-10 00:00:01.
Built for Python 3.10 on linux (64-bit).

In [7]:

Answer selected by henrygriffiths

henrygriffiths Jun 4, 2023
Author

That was set because there were some PDFs that had text output with multiple spaces between words, and I didn't make the connection between the two.
Setting flags = 23 fixes that, and a simple .replace(' ', ' ') solves the duplicate space issue. Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some missing spaces in get_text output #2440

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Some missing spaces in get_text output #2440

Uh oh!

Uh oh!

henrygriffiths May 31, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 2 comments · 3 replies

Uh oh!

JorjMcKie Jun 1, 2023 Maintainer

Uh oh!

JorjMcKie Jun 1, 2023 Maintainer

Uh oh!

henrygriffiths Jun 3, 2023 Author

Uh oh!

JorjMcKie Jun 4, 2023 Maintainer

Uh oh!

henrygriffiths Jun 4, 2023 Author

henrygriffiths
May 31, 2023

Replies: 2 comments 3 replies

JorjMcKie
Jun 1, 2023
Maintainer

JorjMcKie
Jun 1, 2023
Maintainer

henrygriffiths Jun 3, 2023
Author

JorjMcKie Jun 4, 2023
Maintainer

henrygriffiths Jun 4, 2023
Author