tabs returned as linefeeds by page.get_text() #2728

Jacek11 · 2023-10-09T20:29:05Z

Jacek11
Oct 9, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

PDF downloaded from EDGAR. The page.get_text() method is treating tabs as line feeds, causing linefeeds between the currency symbol and amount, for example.

To Reproduce (mandatory)

f = fitz.open(pdf_path)
for page in f:
page_text = page.get_text()

The returned text has many extra '\n's.
pypdf reads the doc correctly.

Expected behavior (optional)

Describe what you expected to happen (if not obvious).
I expected to see spaces instead of \n

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
win32

PyMuPDF 1.23.4: Python bindings for the MuPDF 1.23.2 library.
Version date: 2023-09-26 00:00:01.
Built for Python 3.10 on win32 (64-bit).

Additional context (optional)

Add any other context about the problem here.
sonos_q2_2023_10q.pdf

JorjMcKie · 2023-10-09T20:34:02Z

JorjMcKie
Oct 9, 2023
Maintainer

The document has 60 pages - please pick an example page.

0 replies

Jacek11 · 2023-10-09T20:36:28Z

Jacek11
Oct 9, 2023
Author

Happens with any page with tables. The 29th page f[28] for example.

0 replies

JorjMcKie · 2023-10-10T00:08:14Z

JorjMcKie
Oct 10, 2023
Maintainer

As a matter of fact, the page contains no tab characters (\t) at all - neither \n (line breaks). All of these are generated by the respective extractor software.
For example, between the "$" and the amount here

there is nothing - no tab and no line break.

MuPDF works as designed. Its algorithm that generates the block -> line -> span hierarchy is based on various influencing factors like font size and inter-character distances. A decision is made as to whether text pieces ("spans") belong to the same line or not.
If a distance is larger than some threshold, a new linbe is started - independent from whether also the bottom coordinate changed. In this process, tabulator characters \t are never generated.
This is a design decision.

What causes your irritation happens when tables are present as you noticed.
Extracting plain (naive) text from a page that contains tables does not make much sense in most cases anyway. An adequate solution would be to identify and extract tables separately from other text.
PyMuPDF allows you to do that.

You can also extract the page's word strings, sort them accordingly and then re-arrange them line-wise. Something like that:

words = page.get_text("words", sort=True)
y = words[0][3]  # bottom of first word
text = words[0][4]  # text of first word
for w in words[1:]:  # walk through remaining words
    if w[3] == y:  # if same line
        text += " " + w[4]  # append word text
    else:  # line has changed
        print(text)  # print the finished line
        text=w[4]  # start a new line with current word
        y=w[3]  # and its bottom y
print(text)  # remaining word

The first few lines that this snippet will deliver look like so:

Table of contents
Comparison of the six months ended April 1, 2023, and April 2, 2022
Six Months Ended Change
April 1, April 2,
2023 2022 $ %
(In thousands, except percentages)
Sonos speakers $ 780,377 $ 819,620 $ (39,243 ) (4.8 )%
Sonos system products 158,525 195,965 (37,440 ) (19.1 )
...

5 replies

AlkisPis Oct 29, 2025

I have tried this code with a PDF file of mine, containing fields and values on a same line, which PyMuPDF extracts in two lines. I debugged the code and saw that indeed the words if the values are indeed on different Y coordinates.
Now, my question is, how can 'pypdf' extract them on the same line? The question is rather rhetorical, and what I mean is that if 'pypdf' can do it, it means that either the data obtained for the words are incorrect or there must be a way to tell whether the words in this case must be on the same line or not.

For testing, it's the first page of https://hpc-forge.cineca.it/svn/RemoteGraph/branch/multivnc/PyInstaller/PyInstaller-3.0/doc/Manual.pdf. E.g. the 3d line Version: PyInstaller 3.0.dev8+f1a8933.mod is extracted as
Version:
PyInstaller 3.0.dev8+f1a8933.mod

JorjMcKie Oct 29, 2025
Maintainer

Now, my question is, how can 'pypdf' extract them on the same line?

If you read the documentation, you will find this solution:

The code behind this consolidates text portions with only small y1 differences into one printed line.
In addition, it tries to translate horizontal / vertical position differences into appropriate spacing, ultimately approximating layout fidelity.

AlkisPis Oct 29, 2025

Thanks, JorjMcKie. OK, fields and values are on the same line, but the oveall result is a total mess. I certainly prefer having them in two lines (w/o 'sort=True') :

PyInstaller Manual
Version:
PyInstaller 3.0.dev8+f1a8933.mod This should be with the previous line!
Homepage:
http://www.pyinstaller.org
Contact:
pyinstaller@googlegroups.com
Authors:
David Cortesi
based on structure by Giovanni Bajo & William Caban
based on Gordon McMillan's manual
Copyright:
This document has been placed in the public domain.
PyInstaller Manual -
1

Compare the result with that of 'pypdf'';

PyInstaller Manual
Version: PyInstaller 3.0.dev8+f1a8933.mod
Homepage: http://www.pyinstaller.org
Contact: pyinstaller@googlegroups.com
Authors: David Cortesi
based on structure by Giovanni Bajo & William Caban
based on Gordon McMillan's manual
Copyright: This document has been placed in the public domain.
PyInstaller Manual -
1

JorjMcKie Oct 29, 2025
Maintainer

pymupdf version?

AlkisPis Oct 29, 2025

1.26.4
Anyway, the result I get is exactly the one in your screenshot.

Jacek11 · 2023-10-10T00:23:02Z

Jacek11
Oct 10, 2023
Author

Thanks for the reply. I noticed that table detection was added in a recent release but haven't tried it out yet.

Is there another export type that would preserve the layout better?

0 replies

JorjMcKie · 2023-10-10T00:27:20Z

JorjMcKie
Oct 10, 2023
Maintainer

Is there another export type that would preserve the layout better?

Did you see my code snippet at the end of my post? Might be a decent approximation.
But re-iterating: if tables are present, then don't try plain text extraction.

BTW there also exists layout-preserving text extraction via the "fitz as a module".

Going to move this issue to the "Discussions" tab now.

0 replies

Jacek11 · 2023-10-10T01:03:35Z

Jacek11
Oct 10, 2023
Author

HTML export also has the phantom line breaks. Maybe that's the expected behavior as well. I think that I understand why linefeeds may be warranted in some cases when the text is on the same y coordinate; in cases of multi columns text, as an example.

When extra linefeeds are not inserted, some LLMs can accurately understand tables from plain text. With extra linefeeds, they're far worse at it.

Since the library can already detect tables, any chance that you could include an option in a future release to treat tables differently in the get_text() call?

0 replies

AlkisPis · 2025-11-01T23:01:11Z

AlkisPis
Nov 1, 2025

The only way to extract text correctly is what JorjMcKie has presented earlier on --using words and lines-- only that one must round up rectangle line Ys to int, because the Ys differ slightly when the text is bold and/or italic, in which case one is left with en incomplete line.

Here if the code that worked for me perfectly. I use the same PDF page as earlier, so that one can see compare result.
I use the first page of https://hpc-forge.cineca.it/svn/RemoteGraph/branch/multivnc/PyInstaller/PyInstaller-3.0/doc/Manual.pdf, downloaded as 'test.pdf'.

`
doc = pymupdf.open('test.pdf')
page = doc[0] 
words = page.get_text("words", sort=True)
lasty = -1; line = "" 
for i in range(len(words)):
  y = int(words[i][3]); word = words[i][4] # Fetch line Y and the text
  if y != lasty:
    if line: print(line)
    line = word; lasty = y
  else:
    line += " " + word
if line: print(line)
`

''' 
PyInstaller Manual -
PyInstaller Manual
Version: PyInstaller 3.0.dev8+f1a8933.mod
Homepage: http://www.pyinstaller.org
Contact: pyinstaller@googlegroups.com
Authors: David Cortesi
based on structure by Giovanni Bajo & William Caban
based on Gordon McMillan's manual
Copyright: This document has been placed in the public domain.
1
'''

`

0 replies

AlkisPis · 2025-11-04T22:50:50Z

AlkisPis
Nov 4, 2025

Hi!
I found the best way to extract text, which also procudes spacing corresp. exactly to that of the PDF file.
PyMuPDF used from the command. (I'm using again the same 'test.pdf'

python.exe -m pymupdf gettext test.pdf -output out.txt -pages 1

out.txt:

0 replies

tabs returned as linefeeds by page.get_text() #2728

Uh oh!

Jacek11 Oct 9, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 8 comments · 5 replies

Uh oh!

JorjMcKie Oct 9, 2023 Maintainer

Uh oh!

Jacek11 Oct 9, 2023 Author

Uh oh!

JorjMcKie Oct 10, 2023 Maintainer

Uh oh!

Uh oh!

AlkisPis Oct 29, 2025

Uh oh!

JorjMcKie Oct 29, 2025 Maintainer

Uh oh!

Uh oh!

AlkisPis Oct 29, 2025

Uh oh!

JorjMcKie Oct 29, 2025 Maintainer

Uh oh!

AlkisPis Oct 29, 2025

Uh oh!

Jacek11 Oct 10, 2023 Author

Uh oh!

JorjMcKie Oct 10, 2023 Maintainer

Uh oh!

Jacek11 Oct 10, 2023 Author

Uh oh!

Uh oh!

AlkisPis Nov 1, 2025

Uh oh!

Uh oh!

AlkisPis Nov 4, 2025

Jacek11
Oct 9, 2023

Replies: 8 comments 5 replies

JorjMcKie
Oct 9, 2023
Maintainer

Jacek11
Oct 9, 2023
Author

JorjMcKie
Oct 10, 2023
Maintainer

JorjMcKie Oct 29, 2025
Maintainer

JorjMcKie Oct 29, 2025
Maintainer

Jacek11
Oct 10, 2023
Author

JorjMcKie
Oct 10, 2023
Maintainer

Jacek11
Oct 10, 2023
Author

AlkisPis
Nov 1, 2025

AlkisPis
Nov 4, 2025