Commit a861ed8
authored
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.
**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.1 parent 99f72d6 commit a861ed8
File tree
21 files changed
+998
-135
lines changed- test_unstructured_ingest
- dest
- expected-structured-output
- box/nested-1
- nested-2
- confluence-diff/MFS
- dropbox
- nested-1
- nested-2
- gcs
- nested-1/nested
- nested-2/nested
- metrics/text-extraction
- test_unstructured
- chunking
- common
- partition/html
- unstructured
- chunking
- common
- partition
21 files changed
+998
-135
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
0 commit comments