`TFPredictor.multi_table_predict(sort_row_col_indexes=True)` gives incorrect results when there are empty and multi-column/row cells

Take this page (and predict it):

<img width="613" height="1009" alt="Image" src="https://github.com/user-attachments/assets/5baa76bc-0faf-4426-bdd3-e4e123ad19ea" />

Now use the `TFPredictor` to predict the table with bounding box `(91, 465, 526, 676)`, that is, the table at the bottom of the page *including* the figure to the right.  This gives the following OTSL sequence:

```json
{
        "rs_seq": [
          "ched", "lcel", "lcel", "lcel", "lcel", "lcel", "nl",
          "ched", "ched", "ched", "ched", "ched", "ched", "nl",
          "fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
          "fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
          "fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
          "fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl",
          "fcel", "fcel", "fcel", "fcel", "ucel", "rhed", "nl"
        ]
}
```

Note that there is a column header in the first row that spans 6 columns.  Note also that the fifth column contains a cell that spans 6 rows and contains no text (it has a "ched", which has no text, and five "ucel" cells under it), and thus all of its cells will be considered "empty" by the `CellMatcher` and discarded.

The [code in `multi_table_predict` for `sort_row_col_indexes`](https://github.com/docling-project/docling-ibm-models/blob/main/docling_ibm_models/tableformer/data_management/tf_predictor.py#L511) claims that "ID's returned by Tableformer are sequential, but might contain gaps".  But this is false.  Those ID's *do not come from TableFormer*, they come from the `CellMatcher`! And if they contain gaps, it's because there are empty cells - note that in the case of `do_matching=False`, where no cell matching is done, the IDs are entirely sequential and *never* contain gaps.

Why is this a problem?  Because, while `sort_row_col_indexes` compresses the *start* row/column indexes, it uses the `col_span` and `row_span` attributes as is, and uses them to calcluate the *end* indexes.  This means that, if there is an empty column/row to the right of/below a cell which spans that column/row, its span attributes and its `end_col_offset_idx`/`end_row_offset_idx` will be incorrect as they don't take into account the empty column/row which has been "sorted" out of existence.

Ideally you should just not use `sort_row_col_indexes` since it doesn't do what it claims to do.  But if you insist on sorting away empty rows and columns, the code should be updated to remap end column/row indexes to the newly "compressed" indexes.

Note that "col_span" and "row_span" are totally redundant with the start/end indexes, but I suppose they could also be recalculated based on these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TFPredictor.multi_table_predict(sort_row_col_indexes=True)` gives incorrect results when there are empty and multi-column/row cells #123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TFPredictor.multi_table_predict(sort_row_col_indexes=True) gives incorrect results when there are empty and multi-column/row cells #123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`TFPredictor.multi_table_predict(sort_row_col_indexes=True)` gives incorrect results when there are empty and multi-column/row cells #123