[kensho_kenverters] Add splitting of long tables with project row headers. by libinliang866 · Pull Request #35 · kensho-technologies/kenverters

libinliang866 · 2025-12-30T21:54:19Z

Introduction

It is common that long tables have several sections and there are projected row headers for each section. An example table is as below:

Here Assets, Liabilities and Equity are projected row headers for each section. Currently, the table structure recognition model in Extract can detect the structure of the tables and the projected row headers structure in the table.

The objective of this PR is to split the long table into different sections and link the subtables with the projected row header. At the same time, we need to copy the column headers of the original long table to each subtable. For example, for extracted dataframe above, we hope to split it into dataframe as below:

Revenues:

Expense:

Design Doc

The link of the design doc can be found in the link.

andrew-titus

A few comments -- could you also flesh out the PR description with some more detail on what this is for? Use cases, etc. Thanks!

andrew-titus · 2026-01-05T19:33:47Z

kensho_kenverters/output_to_tables.py

+    max_column_header_row_id = None
+    for row_idx in range(n_row):
+        if row_idx in column_header_rows:
+            max_column_header_row_id = row_idx
+        else:
+            break


It doesn't look like we ever actually use column_header_rows except to find this max -- just find the max during the above loop instead? Otherwise this just wastes some memory and a bit of latency, especially for bigger tables

andrew-titus · 2026-01-05T19:36:01Z

kensho_kenverters/output_to_tables.py

+            max_column_header_row_id = row_idx
+        else:
+            break
+    project_row_headers_rows.sort()


During the first loop, we are just appending to this list and checking for membership (O(N) for a list), just to ultimately sort it here at the end and then return it -- let's just use a set instead for faster membership checks, and then just return the sorted set?

andrew-titus · 2026-01-05T19:37:01Z

kensho_kenverters/output_to_tables.py

+                subtables_row_id_list.append(list(range(row_id_cursor, row_idx)))
+                row_id_cursor = row_idx
+                non_project_row_header_row_id_list = []
+        elif row_idx not in project_row_headers_rows:


This can just be an else -- it's mutually exclusive with the above elif check

Suggested change

elif row_idx not in project_row_headers_rows:

else:

andrew-titus · 2026-01-05T19:39:06Z

kensho_kenverters/output_to_tables.py

    return tables_grid_and_structure


+def _get_max_col_header_row_and_project_header_rows(


A lot of the functions in this file have pretty complex logic -- could you add a few more comments on what's happening in each substep (e.g., each for-loop) to aid the reader?

Hi Drew! Thanks for your comments!

andrew-titus · 2026-01-05T19:40:01Z

kensho_kenverters/output_to_tables.py

+    split_long_tables: bool = False,
 ) -> list[pd.DataFrame]:
-    """Extract Extract output's tables and convert them to a list of pandas DataFrames.
+    """Extract output's tables and convert them to a list of pandas DataFrames.


Nit: wording is a bit clunky here

Suggested change

"""Extract output's tables and convert them to a list of pandas DataFrames.

"""Extract tables from output and convert them to a list of pandas DataFrames.

maxim-sokolov-kensho

A general comment: this code change is not easy to comprehend. I wonder how much of this is incidental complexity. It would be nice to step back and figure out a more composable way of achieving the end result. Maybe, we need a mild refactor of the package, but I would try to simplify this code. A potential benefit is that it might simplify future extensions of the package functionality.

maxim-sokolov-kensho · 2026-01-06T20:25:29Z

kensho_kenverters/extract_output_models.py

+    project_row_headers: list[str]
+    table_uid: str
+    subtable_id: int | None = None


I have hard time understanding these fields in the Table data structure --- we need to update the doc-string explaining what these fields are. The current description is not informative.

maxim-sokolov-kensho · 2026-01-06T20:28:16Z

kensho_kenverters/output_to_tables.py

-    """Convert list of table annotations to list of cells.
+    table_string_grid: TableStringGridType,
+) -> tuple[list[Cell], list[str]]:
+    """Convert list of table annotations to list of cells and return the project row header of the table.


project row header -> projected row headers

maxim-sokolov-kensho · 2026-01-06T20:32:21Z

kensho_kenverters/output_to_tables.py

+def _get_max_col_header_row_and_project_header_rows(
+    annotations: list[AnnotationModel], n_row: int
+) -> tuple[int | None, list[int]]:
+    """Get the maximum consecutive column header row starting from the initial and project header rows."""  # noqa: E501


From reading the doc-string, I couldn't figure out what this function is doing.
First, I would expand the doc-string. Second, potentially, we might want to refactor this function into two functions: it seems that we do two "things" here. Can we do them independently or compose in some way?

libinliang866 · 2026-01-07T13:56:46Z

A general comment: this code change is not easy to comprehend. I wonder how much of this is incidental complexity. It would be nice to step back and figure out a more composable way of achieving the end result. Maybe, we need a mild refactor of the package, but I would try to simplify this code. A potential benefit is that it might simplify future extensions of the package functionality.

Hi Max! Thanks for your comments!

maxim-sokolov-kensho · 2026-01-08T16:20:09Z

kensho_kenverters/extract_output_models.py

-    """Converted table types consisting of the table as a pandas DataFrame and its location(s)."""
+    """Converted table types consisting of the table as a pandas DataFrame and its location(s).
+
+    Note:


This is helpful. Can you also add a description for projected_row_headers?

Hi Max! Yes, will do!

maxim-sokolov-kensho · 2026-01-08T16:22:40Z

kensho_kenverters/output_to_tables.py

+        if table_grid_and_structure.table_category_type in (
            ContentCategory.TABLE.value,
            ContentCategory.TABLE_OF_CONTENTS.value,
        ) or (
            include_figure_extracted_table
-            and table_grid_structure.table_category_type
+            and table_grid_and_structure.table_category_type
            == ContentCategory.FIGURE_EXTRACTED_TABLE.value
        ):
-            table_df = convert_table_to_pd_df(
-                table_grid_structure.table_string_grid,
-                use_first_row_as_header=use_first_row_as_header,
+            if split_long_tables:
+                column_header_row_ids, project_row_headers_row_ids = (
+                    get_column_headers_and_project_row_headers_row_ids(
+                        table_grid_and_structure.table_structure_annotations
+                    )
+                )
+                if table_can_be_split(
+                    column_header_row_ids, project_row_headers_row_ids
+                ):
+                    subtable_string_grid_list, _ = split_table_into_subtables(
+                        table_grid_and_structure,
+                        column_header_row_ids,
+                        project_row_headers_row_ids,
+                    )
+                    for subtable_string_grid in subtable_string_grid_list:
+                        table_dfs.append(
+                            convert_table_to_pd_df(
+                                subtable_string_grid,
+                                use_first_row_as_header=use_first_row_as_header,
+                            )
+                        )
+                    continue


the body of this loop is quite long, and many things happen for each step --- can we extract it as a separate function?

Make sense! Will do!

maxim-sokolov-kensho · 2026-01-08T16:24:35Z

kensho_kenverters/output_to_tables.py

+            project_row_headers = extract_project_row_headers(
+                table_grid_and_structure.table_structure_annotations,
+                table_grid_and_structure.table_string_grid,
+            )


similar comment to the above for this piece of code

Make sense! Will do!

maxim-sokolov-kensho · 2026-01-08T16:27:48Z

kensho_kenverters/tables_utils.py

+            ):
+                if (
+                    annotation.data.is_column_header
+                    and column_row_id not in column_header_row_ids


since we are adding to a set, I don't think we need this check

maxim-sokolov-kensho · 2026-01-08T16:27:54Z

kensho_kenverters/tables_utils.py

+                    column_header_row_ids.add(column_row_id)
+                elif (
+                    annotation.data.is_projected_row_header
+                    and column_row_id not in project_row_headers_row_ids


since we are adding to a set, I don't think we need this check

maxim-sokolov-kensho · 2026-01-08T16:47:12Z

kensho_kenverters/tables_utils.py

+    subtable_row_ids_list = []
+
+    # Split the row ids (after column headers) into a list of sublist of row ids of subtables.
+    row_id_cursor = initial_row_id


what is row_id_cursor? Maybe, a more informative name?

maxim-sokolov-kensho · 2026-01-08T16:50:33Z

kensho_kenverters/tables_utils.py

+
+    # Split the row ids (after column headers) into a list of sublist of row ids of subtables.
+    row_id_cursor = initial_row_id
+    non_project_row_header_row_id_list: list[int] = []


it seems that the cursor above and this list convey a similar information. Can we use just one of them? I would personally go with the list, but I don't have a strong preference

…_row_ids.

Libin Liang added 4 commits December 30, 2025 16:52

add splitting of long tables with project row headers.

a432b0c

fit linting error.

58a28da

fix initial row id.

476ff2a

debug.

00662f4

libinliang866 changed the title ~~[WIP][kensho_kenverters] Add splitting of long tables with project row headers.~~ [kensho_kenverters] Add splitting of long tables with project row headers. Jan 5, 2026

libinliang866 marked this pull request as ready for review January 5, 2026 19:23

libinliang866 requested a review from valerie-fauconmorin-kensho as a code owner January 5, 2026 19:23

libinliang866 requested review from andrew-titus, maxim-sokolov-kensho and mcourtland January 5, 2026 19:23

andrew-titus reviewed Jan 5, 2026

View reviewed changes

address review feedbacks; add more comments.

e71bb6c

maxim-sokolov-kensho reviewed Jan 6, 2026

View reviewed changes

libinliang866 marked this pull request as draft January 7, 2026 03:31

Libin Liang added 4 commits January 7, 2026 12:56

refactor the table splitting codes.

b6be21d

address the doc-string.

8d515bc

mild adjustment.

a8c821f

mild adjustment.

b381389

libinliang866 marked this pull request as ready for review January 7, 2026 20:48

libinliang866 requested review from andrew-titus and maxim-sokolov-kensho January 7, 2026 20:48

adjust comments.

a4a74b6

maxim-sokolov-kensho reviewed Jan 8, 2026

View reviewed changes

libinliang866 marked this pull request as draft January 9, 2026 15:12

Libin Liang added 2 commits January 12, 2026 09:11

adjust comments.

e078722

remove the verification in get_column_headers_and_project_row_headers…

6048758

…_row_ids.

		return tables_grid_and_structure


		def _get_max_col_header_row_and_project_header_rows(

	"""Extract output's tables and convert them to a list of pandas DataFrames.
	"""Extract tables from output and convert them to a list of pandas DataFrames.

Conversation

libinliang866 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Design Doc

Uh oh!

andrew-titus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxim-sokolov-kensho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

libinliang866 commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

libinliang866 commented Dec 30, 2025 •

edited

Loading