Skip to content

tabs returned as linefeeds by page.get_text() #2727

@Jacek11

Description

@Jacek11

Please provide all mandatory information!

Describe the bug (mandatory)

PDF downloaded from EDGAR. The page.get_text() method is treating tabs as line feeds, causing linefeeds between the currency symbol and amount, for example.

To Reproduce (mandatory)

f = fitz.open(pdf_path)
for page in f:
page_text = page.get_text()

The returned text has many extra '\n's.
pypdf reads the doc correctly.

Expected behavior (optional)

Describe what you expected to happen (if not obvious).
I expected to see spaces instead of \n

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
win32

PyMuPDF 1.23.4: Python bindings for the MuPDF 1.23.2 library.
Version date: 2023-09-26 00:00:01.
Built for Python 3.10 on win32 (64-bit).

Additional context (optional)

Add any other context about the problem here.
sonos_q2_2023_10q.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixno intention to resolve

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions