-
Notifications
You must be signed in to change notification settings - Fork 680
Closed
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce
Description
Description of the bug
I am encountering cases where the spans in some of text of some PDFs is collapsing without any whitespace. While less than 1% of the spans are affected, it is very noticeable where it is happening.
In Googling this issue, I see that this it typically related to people specifying TEXT_INHIBIT_SPACES. However, I am not specifying any flags. I did try adding TEXT_PRESERVE_LIGATURES, TEXT_PRESERVE_WHITESPACE, and TEXT_PRESERVE_SPANS to the get_text call, but none of these had any affect.
How to reproduce the bug
!pip install PyMuPDF
import fitz #PyMuPDF
import requests
def download_file(url, filename):
"""Downloads a file from a given URL and saves it locally."""
try:
response = requests.get(url, stream=True)
response.raise_for_status() # Raise an exception for bad status codes
with open(filename, 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
print(f"File downloaded successfully as '{filename}'.")
except requests.exceptions.RequestException as e:
print(f"Error downloading file: {e}")
url = "https://ia904601.us.archive.org/24/items/asme-y-14.5-2018-dimensioning-and-tolerancing/ASME-Y14.5-2018-Dimensioning-and-Tolerancing.pdf"
filename = "sample.pdf"
download_file(url, filename)
pdf_doc = fitz.open(filename)
pdf_page = pdf_doc.load_page(15)
page_dict = pdf_page.get_text('dict')
bad_block = page_dict['blocks'][1]
bad_line = bad_block['lines'][4]
bad_span = bad_line['spans'][0]
print(bad_span['text'])
formerlyinSection1hasbeenreorganizedinto
PyMuPDF version
1.24.13
Operating system
Linux
Python version
3.10
Metadata
Metadata
Assignees
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce