Spotting text in PDF table cells #1657

alphomeg · 2022-03-29T21:36:33Z

alphomeg
Mar 29, 2022

Hi i'm working on a user case where i need to de-structure the whole pdf document and restructure in the flutter app.
de-structuring involve extracting each element (Text,tables,drawings,images) and create a json file out of it by preserving the original layout/sequence of each element. i'm looking to extract tables and drawings as an image and provide the reference of each in my output json. like below:

{'sequence no.': 1, 'type': 'text', 'data': extracted text here}
{'sequence no.': 2, 'type': 'image', 'data': image path}

tables, drawings and extracted images will fit under 'type': 'image'.

Problem
i have implemented recreation of drawings and tables from original drawing or table and drawing it in a new pdf page (haven't converted to image though been stuck here). All the drawings are being drawn perfectly fine so are the tables but the problem is when i get the text from drawing rects in paths from get_drawings() i can easily get the whole text in all formats but when i do the same for tables it just doesn't work and returns empty text. find the code below for drawing in new pdf page and extracting text and then inserting into the rects.

import fitz
#drawing
doc = fitz.open("04225A  Admission to the Neonatal Unit 5.1.pdf")
page = doc[14]
#drawing
#doc = fitz.open("MSECCP002 - SVT - Adults ED Pathway 1.0.pdf")
#page = doc[2]

#table
#doc = fitz.open("MSBPO-18001 Information governance and management policy 1.4.pdf")
#page = doc[26]

#table
#doc = fitz.open("04225A  Admission to the Neonatal Unit 5.1.pdf")
#page = doc[1] #or 0

page.clean_contents()
paths = page.get_drawings()  # extract existing drawings
annot = page.get_bboxlog()
outpdf = fitz.open()
outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
shape = outpage.new_shape()  # make a drawing canvas for the output page
table_flag = True

for path in paths:
    # ------------------------------------
    # draw each entry of the 'items' list
    # ------------------------------------
    seqno = path['seqno']
    #print(seqno)
    for item in path["items"]:  # these are the draw commands
        if item[0] == "l":  # line
            shape.draw_line(item[1], item[2])
        elif item[0] == "re":  # rectangle
            shape.draw_rect(item[1])
        elif item[0] == "qu":  # quad
            shape.draw_quad(item[1])
        elif item[0] == "c":  # curve
            shape.draw_bezier(item[1], item[2], item[3], item[4])
        else:
            raise ValueError("unhandled drawing", item)
            
    text=list()
    rect = path['rect']
    text_dict = page.get_text('dict',clip=rect, flags=fitz.TEXT_INHIBIT_SPACES)
    for block in text_dict['blocks']:
        for lines in block['lines']:
            for span in lines['spans']:
                text.append(span['text'])
    if text and annot[path['seqno']-1][1] == rect :
        text = [t.replace("\n", "").replace("\uf0b7","-").strip(" ") for t in text if t!=" "]
        text1 = list()
        for i,te in enumerate(text):
            if te == "-":
                merged = "".join(text[i:i+2])
                del text[i+1]
                text1.append(merged)
            else:
                text1.append(te)
        rc = shape.insert_textbox(rect,text1,align=1)#,fontsize=size)
        table_flag = False
        if rc <0:
            while True:
                font_size = font_size-0.009
                rc = shape.insert_textbox(rect,text1,fontsize=font_size,align=1)
                if rc >=0:
                    break
            print("inserted")
    elif text and annot[path['seqno']][1] == rect:
        text = [t.replace("\n", "").replace("\uf0b7","-").strip(" ") for t in text if t!=" "]
        text1 = list()
        for i,te in enumerate(text):
            if te == "-":
                merged = "".join(text[i:i+2])
                del text[i+1]
                text1.append(merged)
            else:
                text1.append(te)
        rc = shape.insert_textbox(rect,text1,align=1)#,fontsize=size)
        table_flag = False
        if rc <0:
            while True:
                print("here")
                font_size = font_size-0.009
                rc = shape.insert_textbox(rect,text1,fontsize=font_size,align=1)
                if rc >=0:
                    break
            print("inserted")
    # ------------------------------------------------------
    # all items are drawn, now apply the common properties
    # to finish the path
    # ------------------------------------------------------
    shape.finish(
        fill=path["fill"],  # fill color
        color=path["color"],  # line color
        dashes=path["dashes"],  # line dashing
        even_odd=path.get("even_odd", True),  # control color of overlaps
        closePath=path["closePath"],  # whether to connect last and first point
        lineJoin=path["lineJoin"],  # how line joins should look like
        lineCap=max(path["lineCap"]),  # how line ends should look like
        width=path["width"],  # line width
        stroke_opacity=path.get("stroke_opacity", 1),  # same value for both
        fill_opacity=path.get("fill_opacity", 1),  # opacity parameters
        )
# all paths processed - commit the shape to its page
shape.commit()
outpdf.save("drawings.pdf")

This should work for drawings in the provided documents. i expect this to work for tables as well but no text has been extracted if there's a table as drawing.

The Documents are attached. Please help me out I've tried the rects in get_texttrace as well they're also not returning the full text. I've been stuck at this for days given the potential of pymupdf this is best for for my use case if i can just get all the text for tables as well just like i'm extracting for drawings that'll solve a lot of my problems.
MSECCP002 - SVT - Adults ED Pathway 1.0.pdf
04225A Admission to the Neonatal Unit 5.1.pdf
MSBPO-18001 Information governance and management policy 1.4.pdf

Answered by JorjMcKie

Mar 30, 2022

Indeed, if you do this:

doc=fitz.open(doc.name)
page=doc[26]
for p in paths:
    for item in p["items"]:
        if item[0] == "re":
            r=item[1]
            if r.width >2 and r.height>2:
                page.draw_rect(item[1],color=(1,0,0),width=0.3)

You get this:

So except for the header, no text is extracted based on drawing rectangles.

View full answer

JorjMcKie · 2022-03-30T09:36:53Z

JorjMcKie
Mar 30, 2022
Maintainer

I haven't yet looked into details, but here a first hint:
insert_textbox() may not actually insert text, if the provided rectangle is not large enough to contain all of the text.
You should always check its return code: if negative, nothing will happen. The returned float is the amount of missing rectangle height to contain all.
As an immediate debugging help put the insertion in a while loop and reduce fontsize by some amount while return code < 0.

4 replies

alphomeg Mar 30, 2022
Author

Thank you for your reply, the problem is not insert_textbox() its the get_text(clip=rect). you can see I've already handled the text insertion

if rc <0:
            while True:
                font_size = font_size-0.009
                rc = shape.insert_textbox(rect,text1,fontsize=font_size,align=1)
                if rc >=0:
                    break
            print("inserted")

the rects that I get from get_drawings() are returning text for the drawings but not for the tables. for tables they're just returning the header's text.

alphomeg Mar 30, 2022
Author

When i recreate this workflow in this document

doc = fitz.open("04225A  Admission to the Neonatal Unit 5.1.pdf")
page = doc[14]

this is what I get
drawings 39.pdf

when I recreate the table in this document

doc = fitz.open("MSBPO-18001 Information governance and management policy 1.4.pdf")
page = doc[26]

this is what I get
drawings 40.pdf

JorjMcKie Mar 30, 2022
Maintainer

Ok, I see.
But how do you know that the table cells really are represented by rectangles? This may be the case only for the header cells. The rest of the table may be only visibly divided in cells by drawing a few lines ...!

alphomeg Mar 30, 2022
Author

yes, i've thought that initially but when I check if there's any line being drawn using the following, there's no line only rects

doc = fitz.open("./pdfs/MSBPO-18001 Information governance and management policy 1.4.pdf")
page = doc[26]
page.clean_contents()
trect = list()
paths = page.get_drawings()  # extract existing drawings
for path in paths:
    for item in path["items"]:  # these are the draw commands
        if item[0] == "l":  # line
            print("line")
            shape.draw_line(item[1], item[2])
        elif item[0] == "re":  # rectangle
            shape.draw_rect(item[1])
            trect.append(item[1])
        elif item[0] == "qu":  # quad
            shape.draw_quad(item[1])
        elif item[0] == "c":  # curve
            shape.draw_bezier(item[1], item[2], item[3], item[4])
        else:
            raise ValueError("unhandled drawing", item)
    
print(len(trect))
print(len(paths))

length of paths and trect is the same i-e 156.

JorjMcKie · 2022-03-30T10:12:04Z

JorjMcKie
Mar 30, 2022
Maintainer

Indeed, if you do this:

doc=fitz.open(doc.name)
page=doc[26]
for p in paths:
    for item in p["items"]:
        if item[0] == "re":
            r=item[1]
            if r.width >2 and r.height>2:
                page.draw_rect(item[1],color=(1,0,0),width=0.3)

You get this:

So except for the header, no text is extracted based on drawing rectangles.

1 reply

alphomeg Mar 30, 2022
Author

i see, is there any way i can extract the text from rest of the cells?

JorjMcKie · 2022-03-30T10:17:15Z

JorjMcKie
Mar 30, 2022
Maintainer

Depends on how lines are drawn. Many PDF creators do not actually draw lines, but thin rectangles instead.
You need your own logic to identify the complete shape of the table. Nothing in PDF can help you here.

4 replies

JorjMcKie Mar 30, 2022
Maintainer

For example MS Word and LibreOffice do that in their PDF exports. Obviously because thin rectangles behave more consistent under zooms than drawn lines do.

alphomeg Mar 30, 2022
Author

i have a set of pdfs available and this behaviour is same across all the pdfs, getting the text from headers but not from all cells. i'd appreciate if you can provide some insight in this regard. i can get text from whole page using get_text("dict") along with their rects but I don't see any way I can identify if the rect is in table or not, if there's a way I can mark the boundary of the table and get_text("dict", clip=table_boundary_rect)

JorjMcKie Mar 30, 2022
Maintainer

As I wrote: you are on your own here.
PDF knows nothing about something like a "table" (although PDF spec does have a way to define tables - which is seldomly actually used).
Your PDF creator has used text insertion and draw commands to make the page appearance. Nothing else.
Your logic now must analyze the page, look where the lines and rectangles are and guess whether they actually might make up a table.

alphomeg Mar 30, 2022
Author

Okay thank you for the explanation. much appreciated

JorjMcKie · 2022-03-30T11:07:51Z

JorjMcKie
Mar 30, 2022
Maintainer

I am not sure what your ultimate goal actually is.
But you might be better off to re-write the whole page text independently from re-drawing the drawings.
Because you have all the coordinates for everything, things must make up a satisfactory appearance of the recreated page.

1 reply

alphomeg Mar 30, 2022
Author

the ultimate goal is to se structure the whole pdf document into a json and from that json the document will be recreated in the mobile app for better visualization where we can increase or decrease the text or change the color for visually impaired individuals. tables, images and drawings needs to be extracted as an image. as per the rest of the text its not a much of a problem I can still get every line, heading or prefix.

if I can successfully get the tables in an image I can then iterate over all the page's text and do a standard keyword matching between the keywords of the text from whole page and of the table to replace that text with the table image. hope this makes sense.

alphomeg · 2022-03-30T11:23:20Z

alphomeg
Mar 30, 2022
Author

Sorry for bothering you again there's one thing I don't understand

when I remove this condition if r.width >2 and r.height>2: i'm getting this

also all of the items in paths doesn't contain item[0] == "l" how can we know if there are any lines in the drawing

('re', Rect(72.26399993896484, 72.4759521484375, 135.62399291992188, 113.89996337890625), 1)
('re', Rect(77.42400360107422, 72.47601318359375, 130.46400451660156, 82.82000732421875), 1)
('re', Rect(77.42400360107422, 82.82000732421875, 130.46400451660156, 93.260009765625), 1)
('re', Rect(136.10000610351562, 72.4759521484375, 227.6840057373047, 113.89996337890625), 1)
('re', Rect(141.25999450683594, 72.47601318359375, 222.6439971923828, 82.82000732421875), 1)
('re', Rect(228.2899932861328, 72.4759521484375, 298.6099853515625, 113.89996337890625), 1)
('re', Rect(233.4499969482422, 72.47601318359375, 293.45001220703125, 82.82000732421875), 1)
('re', Rect(233.4499969482422, 82.82000732421875, 293.45001220703125, 93.260009765625), 1)
('re', Rect(299.0899963378906, 72.4759521484375, 376.6340026855469, 113.89996337890625), 1)
('re', Rect(304.25, 72.47601318359375, 371.4739990234375, 82.82000732421875), 1)
('re', Rect(304.25, 82.82000732421875, 371.4739990234375, 93.260009765625), 1)
('re', Rect(377.1099853515625, 72.4759521484375, 447.4539794921875, 113.89996337890625), 1)
('re', Rect(382.2699890136719, 72.47601318359375, 442.2699890136719, 82.82000732421875), 1)
('re', Rect(382.2699890136719, 82.82000732421875, 442.2699890136719, 93.260009765625), 1)
('re', Rect(382.2699890136719, 93.25994873046875, 442.2699890136719, 103.5799560546875), 1)
('re', Rect(382.2699890136719, 103.5799560546875, 442.2699890136719, 113.89996337890625), 1)
('re', Rect(447.94000244140625, 72.4759521484375, 525.4600219726562, 113.89996337890625), 1)
('re', Rect(453.1000061035156, 72.47601318359375, 520.2999877929688, 82.82000732421875), 1)
('re', Rect(453.1000061035156, 82.82000732421875, 520.2999877929688, 93.260009765625), 1)
('re', Rect(453.1000061035156, 93.25994873046875, 520.2999877929688, 103.5799560546875), 1)
('re', Rect(453.1000061035156, 103.5799560546875, 520.2999877929688, 113.89996337890625), 1)
('re', Rect(71.78399658203125, 72.0, 72.26399993896484, 72.47998046875), 1)
('re', Rect(71.78399658203125, 72.0, 72.26399993896484, 72.47998046875), 1)
('re', Rect(72.26399993896484, 72.0, 135.62399291992188, 72.47998046875), 1)
('re', Rect(135.6199951171875, 72.0, 136.09999084472656, 72.47998046875), 1)
('re', Rect(136.10000610351562, 72.0, 227.80401611328125, 72.47998046875), 1)
('re', Rect(227.80999755859375, 72.0, 228.2899932861328, 72.47998046875), 1)
('re', Rect(228.2899932861328, 72.0, 298.6099853515625, 72.47998046875), 1)
('re', Rect(298.6099853515625, 72.0, 299.0899963378906, 72.47998046875), 1)
('re', Rect(299.0899963378906, 72.0, 376.6340026855469, 72.47998046875), 1)
('re', Rect(376.6300048828125, 72.0, 377.1100158691406, 72.47998046875), 1)
('re', Rect(377.1099853515625, 72.0, 447.4539794921875, 72.47998046875), 1)
('re', Rect(447.4599914550781, 72.0, 447.94000244140625, 72.47998046875), 1)
('re', Rect(447.94000244140625, 72.0, 525.4600219726562, 72.47998046875), 1)
('re', Rect(525.4600219726562, 72.0, 525.9400024414062, 72.47998046875), 1)
('re', Rect(525.4600219726562, 72.0, 525.9400024414062, 72.47998046875), 1)
('re', Rect(71.78399658203125, 72.4759521484375, 72.26399993896484, 113.89996337890625), 1)
('re', Rect(135.6199951171875, 72.4759521484375, 136.09999084472656, 113.89996337890625), 1)
('re', Rect(227.80999755859375, 72.4759521484375, 228.2899932861328, 113.89996337890625), 1)
('re', Rect(298.6099853515625, 72.4759521484375, 299.0899963378906, 113.89996337890625), 1)
('re', Rect(376.6300048828125, 72.4759521484375, 377.1100158691406, 113.89996337890625), 1)
('re', Rect(447.4599914550781, 72.4759521484375, 447.94000244140625, 113.89996337890625), 1)
('re', Rect(525.4600219726562, 72.4759521484375, 525.9400024414062, 113.89996337890625), 1)
('re', Rect(71.78399658203125, 113.9000244140625, 72.26399993896484, 114.3800048828125), 1)
('re', Rect(72.26399993896484, 113.9000244140625, 135.62399291992188, 114.3800048828125), 1)
('re', Rect(135.6199951171875, 113.9000244140625, 136.09999084472656, 114.3800048828125), 1)
('re', Rect(136.10000610351562, 113.9000244140625, 227.80401611328125, 114.3800048828125), 1)
('re', Rect(227.80999755859375, 113.9000244140625, 228.2899932861328, 114.3800048828125), 1)
('re', Rect(228.2899932861328, 113.9000244140625, 298.6099853515625, 114.3800048828125), 1)
('re', Rect(298.6099853515625, 113.9000244140625, 299.0899963378906, 114.3800048828125), 1)
('re', Rect(299.0899963378906, 113.9000244140625, 376.6340026855469, 114.3800048828125), 1)
('re', Rect(376.6300048828125, 113.9000244140625, 377.1100158691406, 114.3800048828125), 1)
('re', Rect(377.1099853515625, 113.9000244140625, 447.4539794921875, 114.3800048828125), 1)
('re', Rect(447.4599914550781, 113.9000244140625, 447.94000244140625, 114.3800048828125), 1)
('re', Rect(447.94000244140625, 113.9000244140625, 525.4600219726562, 114.3800048828125), 1)
('re', Rect(525.4600219726562, 113.9000244140625, 525.9400024414062, 114.3800048828125), 1)
('re', Rect(71.78399658203125, 114.37994384765625, 72.26399993896484, 166.219970703125), 1)
('re', Rect(135.6199951171875, 114.37994384765625, 136.09999084472656, 166.219970703125), 1)
('re', Rect(227.80999755859375, 114.37994384765625, 228.2899932861328, 166.219970703125), 1)
('re', Rect(298.6099853515625, 114.37994384765625, 299.0899963378906, 166.219970703125), 1)
('re', Rect(376.6300048828125, 114.37994384765625, 377.1100158691406, 166.219970703125), 1)
('re', Rect(447.4599914550781, 114.37994384765625, 447.94000244140625, 166.219970703125), 1)
('re', Rect(525.4600219726562, 114.37994384765625, 525.9400024414062, 166.219970703125), 1)
('re', Rect(71.78399658203125, 166.22003173828125, 72.26399993896484, 166.70001220703125), 1)
('re', Rect(72.26399993896484, 166.22003173828125, 135.62399291992188, 166.70001220703125), 1)
('re', Rect(135.6199951171875, 166.22003173828125, 136.09999084472656, 166.70001220703125), 1)
('re', Rect(136.10000610351562, 166.22003173828125, 227.80401611328125, 166.70001220703125), 1)
('re', Rect(227.80999755859375, 166.22003173828125, 228.2899932861328, 166.70001220703125), 1)
('re', Rect(228.2899932861328, 166.22003173828125, 298.6099853515625, 166.70001220703125), 1)
('re', Rect(298.6099853515625, 166.22003173828125, 299.0899963378906, 166.70001220703125), 1)
('re', Rect(299.0899963378906, 166.22003173828125, 376.6340026855469, 166.70001220703125), 1)
('re', Rect(376.6300048828125, 166.22003173828125, 377.1100158691406, 166.70001220703125), 1)
('re', Rect(377.1099853515625, 166.22003173828125, 447.4539794921875, 166.70001220703125), 1)
('re', Rect(447.4599914550781, 166.22003173828125, 447.94000244140625, 166.70001220703125), 1)
('re', Rect(447.94000244140625, 166.22003173828125, 525.4600219726562, 166.70001220703125), 1)
('re', Rect(525.4600219726562, 166.22003173828125, 525.9400024414062, 166.70001220703125), 1)
('re', Rect(71.78399658203125, 166.699951171875, 72.26399993896484, 208.0999755859375), 1)
('re', Rect(135.6199951171875, 166.699951171875, 136.09999084472656, 208.0999755859375), 1)
('re', Rect(227.80999755859375, 166.699951171875, 228.2899932861328, 208.0999755859375), 1)
('re', Rect(298.6099853515625, 166.699951171875, 299.0899963378906, 208.0999755859375), 1)
('re', Rect(376.6300048828125, 166.699951171875, 377.1100158691406, 208.0999755859375), 1)
('re', Rect(447.4599914550781, 166.699951171875, 447.94000244140625, 208.0999755859375), 1)
('re', Rect(525.4600219726562, 166.699951171875, 525.9400024414062, 208.0999755859375), 1)
('re', Rect(71.78399658203125, 208.0999755859375, 72.26399993896484, 208.5799560546875), 1)
('re', Rect(72.26399993896484, 208.0999755859375, 135.62399291992188, 208.5799560546875), 1)
('re', Rect(135.6199951171875, 208.0999755859375, 136.09999084472656, 208.5799560546875), 1)
('re', Rect(136.10000610351562, 208.0999755859375, 227.80401611328125, 208.5799560546875), 1)
('re', Rect(227.80999755859375, 208.0999755859375, 228.2899932861328, 208.5799560546875), 1)
('re', Rect(228.2899932861328, 208.0999755859375, 298.6099853515625, 208.5799560546875), 1)
('re', Rect(298.6099853515625, 208.0999755859375, 299.0899963378906, 208.5799560546875), 1)
('re', Rect(299.0899963378906, 208.0999755859375, 376.6340026855469, 208.5799560546875), 1)
('re', Rect(376.6300048828125, 208.0999755859375, 377.1100158691406, 208.5799560546875), 1)
('re', Rect(377.1099853515625, 208.0999755859375, 447.4539794921875, 208.5799560546875), 1)
('re', Rect(447.4599914550781, 208.0999755859375, 447.94000244140625, 208.5799560546875), 1)
('re', Rect(447.94000244140625, 208.0999755859375, 525.4600219726562, 208.5799560546875), 1)
('re', Rect(525.4600219726562, 208.0999755859375, 525.9400024414062, 208.5799560546875), 1)
('re', Rect(71.78399658203125, 208.5859375, 72.26399993896484, 260.3299560546875), 1)
('re', Rect(135.6199951171875, 208.5859375, 136.09999084472656, 260.3299560546875), 1)
('re', Rect(227.80999755859375, 208.5859375, 228.2899932861328, 260.3299560546875), 1)
('re', Rect(298.6099853515625, 208.5859375, 299.0899963378906, 260.3299560546875), 1)
('re', Rect(376.6300048828125, 208.5859375, 377.1100158691406, 260.3299560546875), 1)
('re', Rect(447.4599914550781, 208.5859375, 447.94000244140625, 260.3299560546875), 1)
('re', Rect(525.4600219726562, 208.5859375, 525.9400024414062, 260.3299560546875), 1)
('re', Rect(71.78399658203125, 260.33001708984375, 72.26399993896484, 260.80999755859375), 1)
('re', Rect(135.6199951171875, 260.33001708984375, 136.09999084472656, 260.80999755859375), 1)
('re', Rect(136.10000610351562, 260.33001708984375, 227.80401611328125, 260.80999755859375), 1)
('re', Rect(227.80999755859375, 260.33001708984375, 228.2899932861328, 260.80999755859375), 1)
('re', Rect(228.2899932861328, 260.33001708984375, 298.6099853515625, 260.80999755859375), 1)
('re', Rect(298.6099853515625, 260.33001708984375, 299.0899963378906, 260.80999755859375), 1)
('re', Rect(299.0899963378906, 260.33001708984375, 376.6340026855469, 260.80999755859375), 1)
('re', Rect(376.6300048828125, 260.33001708984375, 377.1100158691406, 260.80999755859375), 1)
('re', Rect(377.1099853515625, 260.33001708984375, 447.4539794921875, 260.80999755859375), 1)
('re', Rect(447.4599914550781, 260.33001708984375, 447.94000244140625, 260.80999755859375), 1)
('re', Rect(525.4600219726562, 260.33001708984375, 525.9400024414062, 260.80999755859375), 1)
('re', Rect(71.78399658203125, 260.80999755859375, 72.26399993896484, 312.52996826171875), 1)
('re', Rect(135.6199951171875, 260.80999755859375, 136.09999084472656, 312.52996826171875), 1)
('re', Rect(227.80999755859375, 260.80999755859375, 228.2899932861328, 312.52996826171875), 1)
('re', Rect(298.6099853515625, 260.80999755859375, 299.0899963378906, 312.52996826171875), 1)
('re', Rect(376.6300048828125, 260.80999755859375, 377.1100158691406, 312.52996826171875), 1)
('re', Rect(447.4599914550781, 260.80999755859375, 447.94000244140625, 312.52996826171875), 1)
('re', Rect(525.4600219726562, 260.80999755859375, 525.9400024414062, 312.52996826171875), 1)
('re', Rect(71.78399658203125, 312.530029296875, 72.26399993896484, 313.010009765625), 1)
('re', Rect(72.26399993896484, 312.530029296875, 135.62399291992188, 313.010009765625), 1)
('re', Rect(135.6199951171875, 312.530029296875, 136.09999084472656, 313.010009765625), 1)
('re', Rect(136.10000610351562, 312.530029296875, 227.80401611328125, 313.010009765625), 1)
('re', Rect(227.80999755859375, 312.530029296875, 228.2899932861328, 313.010009765625), 1)
('re', Rect(228.2899932861328, 312.530029296875, 298.6099853515625, 313.010009765625), 1)
('re', Rect(298.6099853515625, 312.530029296875, 299.0899963378906, 313.010009765625), 1)
('re', Rect(299.0899963378906, 312.530029296875, 376.6340026855469, 313.010009765625), 1)
('re', Rect(376.6300048828125, 312.530029296875, 377.1100158691406, 313.010009765625), 1)
('re', Rect(377.1099853515625, 312.530029296875, 447.4539794921875, 313.010009765625), 1)
('re', Rect(447.4599914550781, 312.530029296875, 447.94000244140625, 313.010009765625), 1)
('re', Rect(447.94000244140625, 312.530029296875, 525.4600219726562, 313.010009765625), 1)
('re', Rect(525.4600219726562, 312.530029296875, 525.9400024414062, 313.010009765625), 1)
('re', Rect(71.78399658203125, 313.010009765625, 72.26399993896484, 344.0899963378906), 1)
('re', Rect(71.78399658203125, 344.0899963378906, 72.26399993896484, 344.5699768066406), 1)
('re', Rect(71.78399658203125, 344.0899963378906, 72.26399993896484, 344.5699768066406), 1)
('re', Rect(72.26399993896484, 344.0899963378906, 135.62399291992188, 344.5699768066406), 1)
('re', Rect(135.6199951171875, 313.010009765625, 136.09999084472656, 344.0899963378906), 1)
('re', Rect(135.6199951171875, 344.0899963378906, 136.09999084472656, 344.5699768066406), 1)
('re', Rect(136.10000610351562, 344.0899963378906, 227.80401611328125, 344.5699768066406), 1)
('re', Rect(227.80999755859375, 313.010009765625, 228.2899932861328, 344.0899963378906), 1)
('re', Rect(227.80999755859375, 344.0899963378906, 228.2899932861328, 344.5699768066406), 1)
('re', Rect(228.2899932861328, 344.0899963378906, 298.6099853515625, 344.5699768066406), 1)
('re', Rect(298.6099853515625, 313.010009765625, 299.0899963378906, 344.0899963378906), 1)
('re', Rect(298.6099853515625, 344.0899963378906, 299.0899963378906, 344.5699768066406), 1)
('re', Rect(299.0899963378906, 344.0899963378906, 376.6340026855469, 344.5699768066406), 1)
('re', Rect(376.6300048828125, 313.010009765625, 377.1100158691406, 344.0899963378906), 1)
('re', Rect(376.6300048828125, 344.0899963378906, 377.1100158691406, 344.5699768066406), 1)
('re', Rect(377.1099853515625, 344.0899963378906, 447.4539794921875, 344.5699768066406), 1)
('re', Rect(447.4599914550781, 313.010009765625, 447.94000244140625, 344.0899963378906), 1)
('re', Rect(447.4599914550781, 344.0899963378906, 447.94000244140625, 344.5699768066406), 1)
('re', Rect(447.94000244140625, 344.0899963378906, 525.4600219726562, 344.5699768066406), 1)
('re', Rect(525.4600219726562, 313.010009765625, 525.9400024414062, 344.0899963378906), 1)
('re', Rect(525.4600219726562, 344.0899963378906, 525.9400024414062, 344.5699768066406), 1)
('re', Rect(525.4600219726562, 344.0899963378906, 525.9400024414062, 344.5699768066406), 1)

0 replies

JorjMcKie · 2022-03-30T11:28:48Z

JorjMcKie
Mar 30, 2022
Maintainer

As I wrote: your PDF creator has decided to not draw lines, but instead thin rectangles. MS Word, LibreOffice always do the same when exporting office documents to PDF.
The only thing you can do is using some heuristics to handle rectangles that are thin enough like you would handle lines.

1 reply

alphomeg Mar 30, 2022
Author

ahh, i see it now. Thanks now I know what the problem is

JorjMcKie · 2022-03-30T12:05:08Z

JorjMcKie
Mar 30, 2022
Maintainer

I am taking the liberty to change the issue title to something which gives others an idea what it is all about.

0 replies

alphomeg · 2022-03-30T19:47:41Z

alphomeg
Mar 30, 2022
Author

Hi, i've found a solution to my problem by using page.get_text("dict", clip=shape.rect) i can easily get all the text with their bboxes contained only in that shape and can easily insert back for re-creation. shape.rect was what I was looking for, the bounding box over the shape only.

Thank you for such a beautiful package. :)

0 replies

Spotting text in PDF table cells #1657

Uh oh!

Uh oh!

alphomeg Mar 29, 2022

Replies: 8 comments · 11 replies

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

Uh oh!

JorjMcKie Mar 30, 2022 Maintainer

Uh oh!

alphomeg Mar 30, 2022 Author

alphomeg
Mar 29, 2022

Replies: 8 comments 11 replies

JorjMcKie
Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

alphomeg Mar 30, 2022
Author

JorjMcKie Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

JorjMcKie
Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

JorjMcKie
Mar 30, 2022
Maintainer

JorjMcKie Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

JorjMcKie Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

JorjMcKie
Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

alphomeg
Mar 30, 2022
Author

JorjMcKie
Mar 30, 2022
Maintainer

alphomeg Mar 30, 2022
Author

JorjMcKie
Mar 30, 2022
Maintainer

alphomeg
Mar 30, 2022
Author