Can we delete vertical text? #2271

sergenti · 2023-02-27T21:57:30Z

sergenti
Feb 27, 2023

🙋🏻‍♂️ Hey there, I noticed that in 99% of my docs vertical texts are pretty much useless, so I want to get rid of them.

❓ I read it can be done with page.get_text("dict", sort=True) but I could not find examples online.

🤖 ChatGPT gave me this output, but I'm pretty sure the "angle" property does not exist after reading the documentation here

import fitz

# open the PDF file
pdf_document = "example.pdf"
doc = fitz.open(pdf_document)

# iterate through all pages of the document
for page in doc:
    # get the blocks of text on the page
    blocks = page.getText("dict")["blocks"]
    # iterate through each block
    for block in blocks:
        # get the orientation of the block
        angle = block["angle"]
        # if the block is vertical, remove it from the page
        if 45 < abs(angle) < 135:
            page.delete_block(block)

# save the modified document
doc.save("modified.pdf")
# close the document
doc.close()

📝 Here is the doc I'm parsing with PyMuPDF

sergenti · 2023-02-27T22:09:54Z

sergenti
Feb 27, 2023
Author

EDIT; maybe I found a way

To determine the orientation of a text block in the current version of PyMuPDF, you can use the bbox property of the block to calculate the width and height of the block. If the height is greater than the width, the block is considered to be in a vertical orientation.

If the above statement is true, then this should do the job, although I'm not sure that deleting a block is changing the pdf variable. Can you guys double-check for me?

def delete_vertical_text(pdf):

    # loop though every page
    for page in pdf:

        # get all the block
        blocks = page.getText("dict")["blocks"]
        
         # loop though every block
        for block in blocks:

            # get the bounding box of the block
            bbox = fitz.Rect(block["bbox"])

            # calculate the width and height of the block
            width, height = bbox.width, bbox.height

            # if the block is vertical, remove it from the page
            if height > width: page.delete_block(block)

    return pdf

0 replies

JorjMcKie · 2023-02-28T06:02:52Z

JorjMcKie
Feb 28, 2023
Maintainer

page.delete_block()???
Does not exist in PyMuPDF. Another example for the severe limitations of artificial "intelligence".
You can use redaction annotations however, which is able to remove text, images and links under a given rectangle.

page.add_redact_annot(bbox1)
page.add_redact_annot(bbox2)
...
page.apply_redactions()

Of course your check for vertical text is error-prone and will work only probably. A clean solution should check for the actual writing direction like this:

for block in page.get_text("dict", flags=fitz.TEXTGLAGS_TEXT)["blocks"]:
    for line in block["lines"]:
        wdir = line["dir"]    # writing direction = (cosine, sine)
        if wdir[0] == 0:  # either 90° or 270°
            page.add_redact_annot(line["bbox"])
page.apply_redactions(images=fitz.REDACT_IMAGE_NONE)  # remove text, but no image

7 replies

JorjMcKie Feb 28, 2023
Maintainer

Can't be - if the text visible as vertical text indeed is text. Have you put in some debugging prints to see that the text is indeed is identified correctly?
Also, why do you overwrite the document object ("return pdf")?
You could simply have this loop:

for page in doc:
    text = cleaned_page_text(page)  # the function removes vertical text and reads and returns the rest

The function cleaned_page_text() could remove vertical via redactions and then return page.get_text().

sergenti Feb 28, 2023
Author

Thanks for the tip! how can I check if the line is indeed vertical?
Btw fitz.TEXTGLAGS_TEXT and fitz.REDACT_IMAGE_NONE do not exist...

JorjMcKie Feb 28, 2023
Maintainer

sorry for the types: TEXTFLAGS_TEXT and PDF_REDACT_IMAGE_NONE are the correct names

JorjMcKie Feb 28, 2023
Maintainer

The check itself is correct, so I don't know what does go wrong. What does your print output say?

julianolm Feb 9, 2024

@sergenti @JorjMcKie that was great. Thank you guys for this discussion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we delete vertical text? #2271

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can we delete vertical text? #2271

Uh oh!

sergenti Feb 27, 2023

Replies: 2 comments · 7 replies

Uh oh!

sergenti Feb 27, 2023 Author

Uh oh!

JorjMcKie Feb 28, 2023 Maintainer

Uh oh!

JorjMcKie Feb 28, 2023 Maintainer

Uh oh!

sergenti Feb 28, 2023 Author

Uh oh!

JorjMcKie Feb 28, 2023 Maintainer

Uh oh!

JorjMcKie Feb 28, 2023 Maintainer

Uh oh!

julianolm Feb 9, 2024

sergenti
Feb 27, 2023

Replies: 2 comments 7 replies

sergenti
Feb 27, 2023
Author

JorjMcKie
Feb 28, 2023
Maintainer

JorjMcKie Feb 28, 2023
Maintainer

sergenti Feb 28, 2023
Author

JorjMcKie Feb 28, 2023
Maintainer

JorjMcKie Feb 28, 2023
Maintainer