Skip to content

[kensho_kenverters] Add splitting of long tables with project row headers.#35

Draft
libinliang866 wants to merge 12 commits intomainfrom
libin/split_long_tables
Draft

[kensho_kenverters] Add splitting of long tables with project row headers.#35
libinliang866 wants to merge 12 commits intomainfrom
libin/split_long_tables

Conversation

@libinliang866
Copy link
Contributor

@libinliang866 libinliang866 commented Dec 30, 2025

Introduction

It is common that long tables have several sections and there are projected row headers for each section. An example table is as below:
截屏2026-01-06 上午11 07 52

Here Assets, Liabilities and Equity are projected row headers for each section. Currently, the table structure recognition model in Extract can detect the structure of the tables and the projected row headers structure in the table.

The objective of this PR is to split the long table into different sections and link the subtables with the projected row header. At the same time, we need to copy the column headers of the original long table to each subtable. For example, for extracted dataframe above, we hope to split it into dataframe as below:

Revenues:
截屏2026-01-06 上午11 09 16

Expense:
截屏2026-01-06 上午11 09 32

Design Doc

The link of the design doc can be found in the link.

@libinliang866 libinliang866 changed the title [WIP][kensho_kenverters] Add splitting of long tables with project row headers. [kensho_kenverters] Add splitting of long tables with project row headers. Jan 5, 2026
@libinliang866 libinliang866 marked this pull request as ready for review January 5, 2026 19:23
Copy link

@andrew-titus andrew-titus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments -- could you also flesh out the PR description with some more detail on what this is for? Use cases, etc. Thanks!

Comment on lines 368 to 373
max_column_header_row_id = None
for row_idx in range(n_row):
if row_idx in column_header_rows:
max_column_header_row_id = row_idx
else:
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we ever actually use column_header_rows except to find this max -- just find the max during the above loop instead? Otherwise this just wastes some memory and a bit of latency, especially for bigger tables

max_column_header_row_id = row_idx
else:
break
project_row_headers_rows.sort()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the first loop, we are just appending to this list and checking for membership (O(N) for a list), just to ultimately sort it here at the end and then return it -- let's just use a set instead for faster membership checks, and then just return the sorted set?

subtables_row_id_list.append(list(range(row_id_cursor, row_idx)))
row_id_cursor = row_idx
non_project_row_header_row_id_list = []
elif row_idx not in project_row_headers_rows:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be an else -- it's mutually exclusive with the above elif check

Suggested change
elif row_idx not in project_row_headers_rows:
else:

return tables_grid_and_structure


def _get_max_col_header_row_and_project_header_rows(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the functions in this file have pretty complex logic -- could you add a few more comments on what's happening in each substep (e.g., each for-loop) to aid the reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Drew! Thanks for your comments!

split_long_tables: bool = False,
) -> list[pd.DataFrame]:
"""Extract Extract output's tables and convert them to a list of pandas DataFrames.
"""Extract output's tables and convert them to a list of pandas DataFrames.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wording is a bit clunky here

Suggested change
"""Extract output's tables and convert them to a list of pandas DataFrames.
"""Extract tables from output and convert them to a list of pandas DataFrames.

Copy link

@maxim-sokolov-kensho maxim-sokolov-kensho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment: this code change is not easy to comprehend. I wonder how much of this is incidental complexity. It would be nice to step back and figure out a more composable way of achieving the end result. Maybe, we need a mild refactor of the package, but I would try to simplify this code. A potential benefit is that it might simplify future extensions of the package functionality.

Comment on lines +35 to +37
project_row_headers: list[str]
table_uid: str
subtable_id: int | None = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have hard time understanding these fields in the Table data structure --- we need to update the doc-string explaining what these fields are. The current description is not informative.

"""Convert list of table annotations to list of cells.
table_string_grid: TableStringGridType,
) -> tuple[list[Cell], list[str]]:
"""Convert list of table annotations to list of cells and return the project row header of the table.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project row header -> projected row headers

def _get_max_col_header_row_and_project_header_rows(
annotations: list[AnnotationModel], n_row: int
) -> tuple[int | None, list[int]]:
"""Get the maximum consecutive column header row starting from the initial and project header rows.""" # noqa: E501

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the doc-string, I couldn't figure out what this function is doing.
First, I would expand the doc-string. Second, potentially, we might want to refactor this function into two functions: it seems that we do two "things" here. Can we do them independently or compose in some way?

@libinliang866 libinliang866 marked this pull request as draft January 7, 2026 03:31
@libinliang866
Copy link
Contributor Author

A general comment: this code change is not easy to comprehend. I wonder how much of this is incidental complexity. It would be nice to step back and figure out a more composable way of achieving the end result. Maybe, we need a mild refactor of the package, but I would try to simplify this code. A potential benefit is that it might simplify future extensions of the package functionality.

Hi Max! Thanks for your comments!

@libinliang866 libinliang866 marked this pull request as ready for review January 7, 2026 20:48
"""Converted table types consisting of the table as a pandas DataFrame and its location(s)."""
"""Converted table types consisting of the table as a pandas DataFrame and its location(s).

Note:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful. Can you also add a description for projected_row_headers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Max! Yes, will do!

Comment on lines 372 to 401
if table_grid_and_structure.table_category_type in (
ContentCategory.TABLE.value,
ContentCategory.TABLE_OF_CONTENTS.value,
) or (
include_figure_extracted_table
and table_grid_structure.table_category_type
and table_grid_and_structure.table_category_type
== ContentCategory.FIGURE_EXTRACTED_TABLE.value
):
table_df = convert_table_to_pd_df(
table_grid_structure.table_string_grid,
use_first_row_as_header=use_first_row_as_header,
if split_long_tables:
column_header_row_ids, project_row_headers_row_ids = (
get_column_headers_and_project_row_headers_row_ids(
table_grid_and_structure.table_structure_annotations
)
)
if table_can_be_split(
column_header_row_ids, project_row_headers_row_ids
):
subtable_string_grid_list, _ = split_table_into_subtables(
table_grid_and_structure,
column_header_row_ids,
project_row_headers_row_ids,
)
for subtable_string_grid in subtable_string_grid_list:
table_dfs.append(
convert_table_to_pd_df(
subtable_string_grid,
use_first_row_as_header=use_first_row_as_header,
)
)
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the body of this loop is quite long, and many things happen for each step --- can we extract it as a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! Will do!

Comment on lines 468 to 537
project_row_headers = extract_project_row_headers(
table_grid_and_structure.table_structure_annotations,
table_grid_and_structure.table_string_grid,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to the above for this piece of code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! Will do!

):
if (
annotation.data.is_column_header
and column_row_id not in column_header_row_ids

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are adding to a set, I don't think we need this check

column_header_row_ids.add(column_row_id)
elif (
annotation.data.is_projected_row_header
and column_row_id not in project_row_headers_row_ids

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are adding to a set, I don't think we need this check

subtable_row_ids_list = []

# Split the row ids (after column headers) into a list of sublist of row ids of subtables.
row_id_cursor = initial_row_id

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is row_id_cursor? Maybe, a more informative name?


# Split the row ids (after column headers) into a list of sublist of row ids of subtables.
row_id_cursor = initial_row_id
non_project_row_header_row_id_list: list[int] = []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that the cursor above and this list convey a similar information. Can we use just one of them? I would personally go with the list, but I don't have a strong preference

@libinliang866 libinliang866 marked this pull request as draft January 9, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants