Skip to content

PyMuPDF collapsing whitespace text in some spans (no flags) #4009

@GaryGen

Description

@GaryGen

Description of the bug

I am encountering cases where the spans in some of text of some PDFs is collapsing without any whitespace. While less than 1% of the spans are affected, it is very noticeable where it is happening.

In Googling this issue, I see that this it typically related to people specifying TEXT_INHIBIT_SPACES. However, I am not specifying any flags. I did try adding TEXT_PRESERVE_LIGATURES, TEXT_PRESERVE_WHITESPACE, and TEXT_PRESERVE_SPANS to the get_text call, but none of these had any affect.

How to reproduce the bug

Shared Colab notebook

!pip install PyMuPDF
import fitz #PyMuPDF
import requests

def download_file(url, filename):
  """Downloads a file from a given URL and saves it locally."""
  try:
    response = requests.get(url, stream=True)
    response.raise_for_status()  # Raise an exception for bad status codes

    with open(filename, 'wb') as file:
      for chunk in response.iter_content(chunk_size=8192):
        file.write(chunk)

    print(f"File downloaded successfully as '{filename}'.")

  except requests.exceptions.RequestException as e:
    print(f"Error downloading file: {e}")

url = "https://ia904601.us.archive.org/24/items/asme-y-14.5-2018-dimensioning-and-tolerancing/ASME-Y14.5-2018-Dimensioning-and-Tolerancing.pdf"

filename = "sample.pdf"
download_file(url, filename)

pdf_doc = fitz.open(filename)
pdf_page = pdf_doc.load_page(15)
page_dict = pdf_page.get_text('dict')
bad_block = page_dict['blocks'][1]
bad_line = bad_block['lines'][4]
bad_span = bad_line['spans'][0]
print(bad_span['text'])

formerlyinSection1hasbeenreorganizedinto

PyMuPDF version

1.24.13

Operating system

Linux

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions