Replies: 6 comments
-
The document has 60 pages - please pick an example page. |
Beta Was this translation helpful? Give feedback.
-
Happens with any page with tables. The 29th page f[28] for example. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply. I noticed that table detection was added in a recent release but haven't tried it out yet. Is there another export type that would preserve the layout better? |
Beta Was this translation helpful? Give feedback.
-
Did you see my code snippet at the end of my post? Might be a decent approximation. BTW there also exists layout-preserving text extraction via the "fitz as a module". Going to move this issue to the "Discussions" tab now. |
Beta Was this translation helpful? Give feedback.
-
HTML export also has the phantom line breaks. Maybe that's the expected behavior as well. I think that I understand why linefeeds may be warranted in some cases when the text is on the same y coordinate; in cases of multi columns text, as an example. When extra linefeeds are not inserted, some LLMs can accurately understand tables from plain text. With extra linefeeds, they're far worse at it. Since the library can already detect tables, any chance that you could include an option in a future release to treat tables differently in the get_text() call? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Please provide all mandatory information!
Describe the bug (mandatory)
PDF downloaded from EDGAR. The page.get_text() method is treating tabs as line feeds, causing linefeeds between the currency symbol and amount, for example.
To Reproduce (mandatory)
f = fitz.open(pdf_path)
for page in f:
page_text = page.get_text()
The returned text has many extra '\n's.
pypdf reads the doc correctly.
Expected behavior (optional)
Describe what you expected to happen (if not obvious).
I expected to see spaces instead of \n
Screenshots (optional)
If applicable, add screenshots to help explain your problem.
Your configuration (mandatory)
3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
win32
PyMuPDF 1.23.4: Python bindings for the MuPDF 1.23.2 library.
Version date: 2023-09-26 00:00:01.
Built for Python 3.10 on win32 (64-bit).
Additional context (optional)
Add any other context about the problem here.
sonos_q2_2023_10q.pdf
Beta Was this translation helpful? Give feedback.
All reactions