-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Please provide all mandatory information!
Describe the bug (mandatory)
PDF downloaded from EDGAR. The page.get_text() method is treating tabs as line feeds, causing linefeeds between the currency symbol and amount, for example.
To Reproduce (mandatory)
f = fitz.open(pdf_path)
for page in f:
page_text = page.get_text()
The returned text has many extra '\n's.
pypdf reads the doc correctly.
Expected behavior (optional)
Describe what you expected to happen (if not obvious).
I expected to see spaces instead of \n
Screenshots (optional)
If applicable, add screenshots to help explain your problem.
Your configuration (mandatory)
3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
win32
PyMuPDF 1.23.4: Python bindings for the MuPDF 1.23.2 library.
Version date: 2023-09-26 00:00:01.
Built for Python 3.10 on win32 (64-bit).
Additional context (optional)
Add any other context about the problem here.
sonos_q2_2023_10q.pdf