-
Notifications
You must be signed in to change notification settings - Fork 679
Closed
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce
Description
Please provide all mandatory information!
Describe the bug (mandatory)
get_drawings() can't detect line
To Reproduce (mandatory)
Here is the code you written in the past issues for recognizing tables by the lines. But after I update the version of Pymupdf, I can't get lines but all rects.
def get_table_location(page: fitz.Page) -> [fitz.Rect]:
"""
Get the location of tables in page
by finding horizontal lines with same length
Parameters
----------
page: page object of pdf
Returns
-------
table_rects: rectangles that contain tables
"""
# make a list of horizontal lines
# each line is represented by y and length
hor_lines = []
paths = page.getDrawings()
pprint(paths)
for p in paths:
for item in p["items"]:
if item[0] == "l": # this is a line item
p1 = item[1] # start point
p2 = item[2] # stop point
if p1.y == p2.y: # line horizontal?
hor_lines.append((p1.y, p2.x - p1.x)) # potential table delimiter
# find whether table exists by number of lines with same length > 3
table_rects = []
# sort the list for ensuring the correct group by same keys
hor_lines.sort(key=lambda x: x[1])
# getting the top-left point and bottom-right point of table
for k, g in groupby(hor_lines, key=lambda x: x[1]):
g = list(g)
if len(g) >= 3: # number of lines of table will always >= 3
g.sort(key=lambda x: x[0]) # sort by y value
top_left = fitz.Point(0, g[0][0])
bottom_right = fitz.Point(page.rect.width, g[-1][0])
table_rects.append(fitz.Rect(top_left, bottom_right))
return table_rectsthis is the sample file.
https://drive.google.com/file/d/1ww1ZiBdQLWTHPmNbadTWXA_LU-rUW2tt/view?usp=sharing
Expected behavior (optional)
Detect lines and detect tables
Your configuration (mandatory)
- Operating system, windows 10
- Python version 3.8.5
- PyMuPDF version 1.18.9
Metadata
Metadata
Assignees
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce