-
Notifications
You must be signed in to change notification settings - Fork 58
Description
Take this page (and predict it):
Now use the TFPredictor to predict the table with bounding box (91, 465, 526, 676), that is, the table at the bottom of the page including the figure to the right. This gives the following OTSL sequence:
{
"rs_seq": [
"ched", "lcel", "lcel", "lcel", "lcel", "lcel", "nl",
"ched", "ched", "ched", "ched", "ched", "ched", "nl",
"fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
"fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
"fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
"fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
"fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl"
]
}Note that there is a column header in the first row that spans 6 columns. Note also that the fifth column contains a cell that spans 6 rows and contains no text (it has a "ched", which has no text, and five "ucel" cells under it), and thus all of its cells will be considered "empty" by the CellMatcher and discarded.
The code in multi_table_predict for sort_row_col_indexes claims that "ID's returned by Tableformer are sequential, but might contain gaps". But this is false. Those ID's do not come from TableFormer, they come from the CellMatcher! And if they contain gaps, it's because there are empty cells - note that in the case of do_matching=False, where no cell matching is done, the IDs are entirely sequential and never contain gaps.
Why is this a problem? Because, while sort_row_col_indexes compresses the start row/column indexes, it uses the col_span and row_span attributes as is, and uses them to calcluate the end indexes. This means that, if there is an empty column/row to the right of/below a cell which spans that column/row, its span attributes and its end_col_offset_idx/end_row_offset_idx will be incorrect as they don't take into account the empty column/row which has been "sorted" out of existence.
Ideally you should just not use sort_row_col_indexes since it doesn't do what it claims to do. But if you insist on sorting away empty rows and columns, the code should be updated to remap end column/row indexes to the newly "compressed" indexes.
Note that "col_span" and "row_span" are totally redundant with the start/end indexes, but I suppose they could also be recalculated based on these.