fix: handle case of overlapping cells in contract_cells_into_lines_v2#105
fix: handle case of overlapping cells in contract_cells_into_lines_v2#105dhdaines wants to merge 1 commit intodocling-project:mainfrom
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
5b3dae1 to
f9adab1
Compare
|
Here is a Python implementation of convex polygon intersection (based on https://www.gorillasun.de/blog/an-algorithm-for-polygon-intersections/). Updated to slim it down since we only care if they intersect, not where, so the point-in-polygon test is sufficient. It's simple enough that I should be able to implement it in C++ without losing my patience with the compiler... from typing import Iterator, Sequence, Tuple
Point = Tuple[float, float]
def edges(poly: Sequence[Point]) -> Iterator[Tuple[Point, Point]]:
"""Iterate over edges of a polygon."""
for i in range(len(poly)):
yield poly[i-1], poly[i]
def pnpoly(point: Point, poly: Sequence[Point]) -> bool:
"""Use even-odd rule to determine point-in-polygon."""
x, y = point
inside = False
for (xi, yi), (xj, yj) in edges(poly):
if (yi > y) is not (yj > y):
if x < (xj - xi) * (y - yi) / (yj - yi) + xi:
inside = not inside
return inside
def polys_intersect(a: Sequence[Point], b: Sequence[Point]) -> bool:
"""Determine if two convex polygons intersect (including the case
where one is contained entirely in the other)."""
for point in a:
if pnpoly(point, b):
return True
for point in b:
if pnpoly(point, a):
return True
return False |
65445ed to
0db1e30
Compare
Signed-off-by: David Huggins-Daines <dhd@ecolingui.ca>
|
So, I've implemented the overlap detection, but it doesn't actually fix the issue, because the character bboxes are still wrong for Type3 fonts. But also it causes some problems for instance in https://github.com/docling-project/docling-parse/blob/main/docs/visualisations/table_of_contents_01.pdf.page_2.word.png where the characters in the figures legitimately have overlapping bboxes. I'm closing this PR for the moment, though you may wish to keep the code for future use. I think a fix for the underlying Type3 font problem should be fairly easy so I'll make a new PR for that. |
Fixes: #99
As noted there, the (nicely refactored!) implementation no longer correctly handles the case where text cells overlap.
The fact that they overlap in the first place is most likely a separate and more complex bug, but this fixes the symptoms, at least.