TEXT_DEHYPHENATE not working properly #1926

joaquimcampos · 2022-09-13T12:08:24Z

joaquimcampos
Sep 13, 2022

Bug report

Running text extraction with TEXT_DEHYPHENATE does not produce the expected behaviour for the following pdf: issue_one_page.pdf. (But it does work correctly on other pages...)

To reproduce, run the following code on the pdf issue_one_page.pdf.

import fitz
import click
from fitz.fitz import (
    TEXTFLAGS_TEXT,
    TEXT_DEHYPHENATE
)


def main(pdf_file):

    doc = fitz.open(pdf_file)
    page = doc[0]

    text = page.get_text(flags=(TEXTFLAGS_TEXT | TEXT_DEHYPHENATE))
    print(text)


@click.command
@click.argument('pdf-file', type=click.Path(exists=True))
def cli(pdf_file):
    main(pdf_file)


if __name__ == '__main__':
    cli()

This gives

$ python3 issue.py issue_one_page.pdf
42
Ένα τίποτα μπορεί ν’ αλλάξει τα πάντα
Όταν επιτέλους περάσετε την Καμπή της Λανθάνουσας Δυ-
νατότητας, οι περισσότεροι θα θεωρήσουν ότι τα καταφέρατε εν 
μία νυκτί! Ο κόσμος που μας περιβάλλει, βλέπει μόνο την κο-
ρύφωση της δράσης μας και όχι όσα προηγήθηκαν. Εσείς όμως, 
γνωρίζετε ότι η επιτυχία σας έγινε εφικτή χάρη στην προσπάθεια 
που καταβάλατε για πολύ καιρό, όταν πιστεύατε ότι δεν σημειώ-
νατε πρόοδο. 
Είναι το ανθρώπινο ισοδύναμο της γεωλογικής πίεσης. Δύο 
τεκτονικές πλάκες μπορεί να συγκλίνουν μεταξύ τους για εκατομ-
μύρια χρόνια και η πίεση σταδιακά να συσσωρεύεται. Κι έπειτα 
κάποια μέρα, τρίβονται μεταξύ τους και πάλι με τον ίδιο τρόπο 
που το έκαναν όλα αυτά τα χρόνια, αλλά αυτή τη φορά η πίεση 
είναι μεγάλη. Γίνεται σεισμός. H αλλαγή μπορεί να συντελείται 
χρόνια, μέχρι να φτάσει στο σημείο της ορατής της εκτόνωσης.
Η επιδεξιότητα απαιτεί υπομονή. Οι Σαν Αντόνιο Σπερς (23), 
μια από τις πιο επιτυχημένες ομάδες στην ιστορία του NBA, 
έχουν μια φράση του κοινωνικού μεταρρυθμιστή Τζέικομπ Ρίις 
στα αποδυτήριά τους: «Όταν απελπίζομαι, κάθομαι και κοιτάζω 
κάποιον λιθοξόο να σφυροκοπάει την πέτρα του. Τη σφυροκοπά-
ει ίσως και εκατό φορές, χωρίς να σχηματίζεται ούτε μια ρωγμή 
στην επιφάνειά της. Κι όμως στο εκατοστό πρώτο χτύπημα η πέ-
τρα θα κοπεί στα δύο και ξέρω ότι αυτό δεν οφείλεται στο τελευ-
ταίο χτύπημα, αλλά σε όλα όσα είχαν προηγηθεί».
ΑΠΟΤΕΛΕΣΜΑΤΑ

joaquimcampos · 2022-09-15T11:24:17Z

joaquimcampos
Sep 15, 2022
Author

I believe the issue is that the text extraction is identifying different lines as belonging to different blocks, and TEXT_DEHYPHENATE only joins lines and spans within the same block.

0 replies

JorjMcKie · 2022-09-15T11:51:07Z

JorjMcKie
Sep 15, 2022
Maintainer

Ah, have you confirmed this is the case here?
I have starte studying the file, but I didn't look at that detail yet.
If the lines indeed are in different blocks, then you are quite right ...

0 replies

JorjMcKie · 2022-09-15T12:14:53Z

JorjMcKie
Sep 15, 2022
Maintainer

Just tested it: you are right!
Every line is in its own block. So indeed dehyphenation cannot work.
The algorithm behind bringing text into the block/line/span hierarchy (located within MuPDF) takes a bunch of criteria into account like inter-line distance, font size, font characteristics (ascender, descender) and more ... but no interpretation of the text itself.

In this case, each line height is 12.74. The distance between a line's bottom to the next line's top is 4.3.
Also - as a preliminary analysis shows - each line is coded in its own PDF text object, i.e. wrapped in its own string pairs BT/ET.
Obviously, taken together this was too much for MuPDF to put the lines in the same blocks.

So you were having the right idea - this example is not suitable for dehyphenation.

0 replies

JorjMcKie · 2022-09-15T12:19:21Z

JorjMcKie
Sep 15, 2022
Maintainer

Based on the insight presented by your example, we will insert a comment in the documentation.

0 replies

jamie-lemon · 2022-09-15T12:33:20Z

jamie-lemon
Sep 15, 2022
Maintainer

I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not.

0 replies

JorjMcKie · 2022-09-15T12:59:03Z

JorjMcKie
Sep 15, 2022
Maintainer

I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not.

I am afraid this would have to happen inside MuPDF's text page logic. Any change we may want to introduce has consequences that also apply to things like text search - not yet talking about that subsequent lines may not have the same inclination angle. Also, if text is not coded in reading sequence, the whole thing breaks down anyway.
We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block.
As per today, there are no attempts inside PyMuPDF to interfere here - PyMuPDF just passes the text flags bit field on to MuPDF's text page creation.

0 replies

JorjMcKie · 2022-09-15T13:00:49Z

JorjMcKie
Sep 15, 2022
Maintainer

I think this issue has now turned into a discussion item, so let me transfer it to there.

0 replies

joaquimcampos · 2022-09-15T15:05:11Z

joaquimcampos
Sep 15, 2022
Author

" We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block."

I think this is a wise choice since visually the lines do seem to belong in the same block.

I have written my own python code to merge blocks where the last line of the first and first line of the next fit some criteria (relative vertical distance, horizontal position, etc.). This solved the issue.

0 replies

TEXT_DEHYPHENATE not working properly #1926

Uh oh!

Uh oh!

joaquimcampos Sep 13, 2022

Replies: 8 comments

Uh oh!

joaquimcampos Sep 15, 2022 Author

Uh oh!

JorjMcKie Sep 15, 2022 Maintainer

Uh oh!

JorjMcKie Sep 15, 2022 Maintainer

Uh oh!

JorjMcKie Sep 15, 2022 Maintainer

Uh oh!

jamie-lemon Sep 15, 2022 Maintainer

Uh oh!

JorjMcKie Sep 15, 2022 Maintainer

Uh oh!

JorjMcKie Sep 15, 2022 Maintainer

Uh oh!

Uh oh!

joaquimcampos Sep 15, 2022 Author

joaquimcampos
Sep 13, 2022

joaquimcampos
Sep 15, 2022
Author

JorjMcKie
Sep 15, 2022
Maintainer

JorjMcKie
Sep 15, 2022
Maintainer

JorjMcKie
Sep 15, 2022
Maintainer

jamie-lemon
Sep 15, 2022
Maintainer

JorjMcKie
Sep 15, 2022
Maintainer

JorjMcKie
Sep 15, 2022
Maintainer

joaquimcampos
Sep 15, 2022
Author