Skip to content

get_drawings() can't detect line #925

@HeroadZ

Description

@HeroadZ

Please provide all mandatory information!

Describe the bug (mandatory)

get_drawings() can't detect line

To Reproduce (mandatory)

Here is the code you written in the past issues for recognizing tables by the lines. But after I update the version of Pymupdf, I can't get lines but all rects.

def get_table_location(page: fitz.Page) -> [fitz.Rect]:
    """
    Get the location of tables in page
    by finding horizontal lines with same length

    Parameters
    ----------
    page: page object of pdf

    Returns
    -------
    table_rects: rectangles that contain tables
    """

    # make a list of horizontal lines
    # each line is represented by y and length
    hor_lines = []
    paths = page.getDrawings()
    pprint(paths)
    for p in paths:
        for item in p["items"]:
            if item[0] == "l":  # this is a line item
                p1 = item[1]  # start point
                p2 = item[2]  # stop point
                if p1.y == p2.y:  # line horizontal?
                    hor_lines.append((p1.y, p2.x - p1.x))  # potential table delimiter

    # find whether table exists by number of lines with same length > 3
    table_rects = []
    # sort the list for ensuring the correct group by same keys
    hor_lines.sort(key=lambda x: x[1])
    # getting the top-left point and bottom-right point of table
    for k, g in groupby(hor_lines, key=lambda x: x[1]):
        g = list(g)
        if len(g) >= 3:  # number of lines of table will always >= 3
            g.sort(key=lambda x: x[0])  # sort by y value
            top_left = fitz.Point(0, g[0][0])
            bottom_right = fitz.Point(page.rect.width, g[-1][0])
            table_rects.append(fitz.Rect(top_left, bottom_right))

    return table_rects

this is the sample file.
https://drive.google.com/file/d/1ww1ZiBdQLWTHPmNbadTWXA_LU-rUW2tt/view?usp=sharing

Expected behavior (optional)

Detect lines and detect tables

Your configuration (mandatory)

  • Operating system, windows 10
  • Python version 3.8.5
  • PyMuPDF version 1.18.9

Metadata

Metadata

Assignees

Labels

not a bugnot a bug / user error / unable to reproduce

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions