[kensho_kenverters] Add splitting of long tables with project row headers.#35
[kensho_kenverters] Add splitting of long tables with project row headers.#35libinliang866 wants to merge 12 commits intomainfrom
Conversation
andrew-titus
left a comment
There was a problem hiding this comment.
A few comments -- could you also flesh out the PR description with some more detail on what this is for? Use cases, etc. Thanks!
| max_column_header_row_id = None | ||
| for row_idx in range(n_row): | ||
| if row_idx in column_header_rows: | ||
| max_column_header_row_id = row_idx | ||
| else: | ||
| break |
There was a problem hiding this comment.
It doesn't look like we ever actually use column_header_rows except to find this max -- just find the max during the above loop instead? Otherwise this just wastes some memory and a bit of latency, especially for bigger tables
| max_column_header_row_id = row_idx | ||
| else: | ||
| break | ||
| project_row_headers_rows.sort() |
There was a problem hiding this comment.
During the first loop, we are just appending to this list and checking for membership (O(N) for a list), just to ultimately sort it here at the end and then return it -- let's just use a set instead for faster membership checks, and then just return the sorted set?
| subtables_row_id_list.append(list(range(row_id_cursor, row_idx))) | ||
| row_id_cursor = row_idx | ||
| non_project_row_header_row_id_list = [] | ||
| elif row_idx not in project_row_headers_rows: |
There was a problem hiding this comment.
This can just be an else -- it's mutually exclusive with the above elif check
| elif row_idx not in project_row_headers_rows: | |
| else: |
| return tables_grid_and_structure | ||
|
|
||
|
|
||
| def _get_max_col_header_row_and_project_header_rows( |
There was a problem hiding this comment.
A lot of the functions in this file have pretty complex logic -- could you add a few more comments on what's happening in each substep (e.g., each for-loop) to aid the reader?
There was a problem hiding this comment.
Hi Drew! Thanks for your comments!
| split_long_tables: bool = False, | ||
| ) -> list[pd.DataFrame]: | ||
| """Extract Extract output's tables and convert them to a list of pandas DataFrames. | ||
| """Extract output's tables and convert them to a list of pandas DataFrames. |
There was a problem hiding this comment.
Nit: wording is a bit clunky here
| """Extract output's tables and convert them to a list of pandas DataFrames. | |
| """Extract tables from output and convert them to a list of pandas DataFrames. |
maxim-sokolov-kensho
left a comment
There was a problem hiding this comment.
A general comment: this code change is not easy to comprehend. I wonder how much of this is incidental complexity. It would be nice to step back and figure out a more composable way of achieving the end result. Maybe, we need a mild refactor of the package, but I would try to simplify this code. A potential benefit is that it might simplify future extensions of the package functionality.
| project_row_headers: list[str] | ||
| table_uid: str | ||
| subtable_id: int | None = None |
There was a problem hiding this comment.
I have hard time understanding these fields in the Table data structure --- we need to update the doc-string explaining what these fields are. The current description is not informative.
| """Convert list of table annotations to list of cells. | ||
| table_string_grid: TableStringGridType, | ||
| ) -> tuple[list[Cell], list[str]]: | ||
| """Convert list of table annotations to list of cells and return the project row header of the table. |
There was a problem hiding this comment.
project row header -> projected row headers
| def _get_max_col_header_row_and_project_header_rows( | ||
| annotations: list[AnnotationModel], n_row: int | ||
| ) -> tuple[int | None, list[int]]: | ||
| """Get the maximum consecutive column header row starting from the initial and project header rows.""" # noqa: E501 |
There was a problem hiding this comment.
From reading the doc-string, I couldn't figure out what this function is doing.
First, I would expand the doc-string. Second, potentially, we might want to refactor this function into two functions: it seems that we do two "things" here. Can we do them independently or compose in some way?
Hi Max! Thanks for your comments! |
| """Converted table types consisting of the table as a pandas DataFrame and its location(s).""" | ||
| """Converted table types consisting of the table as a pandas DataFrame and its location(s). | ||
|
|
||
| Note: |
There was a problem hiding this comment.
This is helpful. Can you also add a description for projected_row_headers?
There was a problem hiding this comment.
Hi Max! Yes, will do!
| if table_grid_and_structure.table_category_type in ( | ||
| ContentCategory.TABLE.value, | ||
| ContentCategory.TABLE_OF_CONTENTS.value, | ||
| ) or ( | ||
| include_figure_extracted_table | ||
| and table_grid_structure.table_category_type | ||
| and table_grid_and_structure.table_category_type | ||
| == ContentCategory.FIGURE_EXTRACTED_TABLE.value | ||
| ): | ||
| table_df = convert_table_to_pd_df( | ||
| table_grid_structure.table_string_grid, | ||
| use_first_row_as_header=use_first_row_as_header, | ||
| if split_long_tables: | ||
| column_header_row_ids, project_row_headers_row_ids = ( | ||
| get_column_headers_and_project_row_headers_row_ids( | ||
| table_grid_and_structure.table_structure_annotations | ||
| ) | ||
| ) | ||
| if table_can_be_split( | ||
| column_header_row_ids, project_row_headers_row_ids | ||
| ): | ||
| subtable_string_grid_list, _ = split_table_into_subtables( | ||
| table_grid_and_structure, | ||
| column_header_row_ids, | ||
| project_row_headers_row_ids, | ||
| ) | ||
| for subtable_string_grid in subtable_string_grid_list: | ||
| table_dfs.append( | ||
| convert_table_to_pd_df( | ||
| subtable_string_grid, | ||
| use_first_row_as_header=use_first_row_as_header, | ||
| ) | ||
| ) | ||
| continue |
There was a problem hiding this comment.
the body of this loop is quite long, and many things happen for each step --- can we extract it as a separate function?
There was a problem hiding this comment.
Make sense! Will do!
| project_row_headers = extract_project_row_headers( | ||
| table_grid_and_structure.table_structure_annotations, | ||
| table_grid_and_structure.table_string_grid, | ||
| ) |
There was a problem hiding this comment.
similar comment to the above for this piece of code
There was a problem hiding this comment.
Make sense! Will do!
kensho_kenverters/tables_utils.py
Outdated
| ): | ||
| if ( | ||
| annotation.data.is_column_header | ||
| and column_row_id not in column_header_row_ids |
There was a problem hiding this comment.
since we are adding to a set, I don't think we need this check
kensho_kenverters/tables_utils.py
Outdated
| column_header_row_ids.add(column_row_id) | ||
| elif ( | ||
| annotation.data.is_projected_row_header | ||
| and column_row_id not in project_row_headers_row_ids |
There was a problem hiding this comment.
since we are adding to a set, I don't think we need this check
| subtable_row_ids_list = [] | ||
|
|
||
| # Split the row ids (after column headers) into a list of sublist of row ids of subtables. | ||
| row_id_cursor = initial_row_id |
There was a problem hiding this comment.
what is row_id_cursor? Maybe, a more informative name?
|
|
||
| # Split the row ids (after column headers) into a list of sublist of row ids of subtables. | ||
| row_id_cursor = initial_row_id | ||
| non_project_row_header_row_id_list: list[int] = [] |
There was a problem hiding this comment.
it seems that the cursor above and this list convey a similar information. Can we use just one of them? I would personally go with the list, but I don't have a strong preference
Introduction
It is common that long tables have several sections and there are projected row headers for each section. An example table is as below:

Here Assets, Liabilities and Equity are projected row headers for each section. Currently, the table structure recognition model in Extract can detect the structure of the tables and the projected row headers structure in the table.
The objective of this PR is to split the long table into different sections and link the subtables with the projected row header. At the same time, we need to copy the column headers of the original long table to each subtable. For example, for extracted dataframe above, we hope to split it into dataframe as below:
Revenues:

Expense:

Design Doc
The link of the design doc can be found in the link.