Skip to content

Commit c85f29e

Browse files
authored
fix(xlsx): XLSX emits std minified .text_as_html (#3558)
**Summary** Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html` consistent with that formed by chunking. **Additional Context** - XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. --------- Co-authored-by: scanny <[email protected]>
1 parent b092d45 commit c85f29e

File tree

11 files changed

+182
-199
lines changed

11 files changed

+182
-199
lines changed

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.16.1-dev1
1+
## 0.16.1-dev2
22

33
### Enhancements
44

@@ -8,6 +8,9 @@
88

99
* **Remove unsupported chipper model**
1010
* **Rewrite of `partition.email` module and tests.** Use modern Python stdlib `email` module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
11+
* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
12+
* **Fall back to filename extension-based file-type detection for unidentified OLE files.** Resolves a problem where a DOC file that could not be detected as such by `filetype` was incorrectly identified as a MSG file.
13+
* **Minify text_as_html from XLSX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
1114

1215
## 0.16.0
1316

example-docs/empty.xlsx

8 KB
Binary file not shown.

test_unstructured/chunking/test_base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1060,7 +1060,7 @@ def it_knows_the_concatenated_text_of_the_pre_chunk_to_help(
10601060
class Describe_TableSplitter:
10611061
"""Unit-test suite for `unstructured.chunking.base._TableSplitter`."""
10621062

1063-
def it_splits_an_HTML_table_on_even_rows_when_possible(self):
1063+
def it_splits_an_HTML_table_on_whole_row_boundaries_when_possible(self):
10641064
opts = ChunkingOptions(max_characters=(150))
10651065
html_table = HtmlTable.from_html_text(
10661066
"""

test_unstructured/partition/test_auto.py

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -794,19 +794,10 @@ def test_auto_partition_xls_from_filename():
794794
example_doc_path("tests-example.xls"), include_header=False, skip_infer_table_types=[]
795795
)
796796

797-
assert sum(isinstance(element, Table) for element in elements) == 2
798797
assert len(elements) == 14
799-
800-
assert clean_extra_whitespace(elements[0].text)[:45] == (
801-
"MC What is 2+2? 4 correct 3 incorrect MA What"
802-
)
803-
# NOTE(crag): if the beautifulsoup4 package is installed, some (but not all) additional
804-
# whitespace is removed, so the expected text length is less than is the case when
805-
# beautifulsoup4 is *not* installed. E.g.
806-
# "\n\n\nMA\nWhat C datatypes are 8 bits"
807-
# vs. '\n \n \n MA\n What C datatypes are 8 bits?... "
808-
assert len(elements[0].text) == 550
798+
assert sum(isinstance(e, Table) for e in elements) == 2
809799
assert elements[0].metadata.text_as_html == EXPECTED_XLS_TABLE
800+
assert len(elements[0].text) == 507
810801

811802

812803
# ================================================================================================

test_unstructured/partition/test_constants.py

Lines changed: 80 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -28,30 +28,14 @@
2828
</tbody>
2929
</table>"""
3030

31-
EXPECTED_TABLE_XLSX = """<table border="1" class="dataframe">
32-
<tbody>
33-
<tr>
34-
<td>Team</td>
35-
<td>Location</td>
36-
<td>Stanley Cups</td>
37-
</tr>
38-
<tr>
39-
<td>Blues</td>
40-
<td>STL</td>
41-
<td>1</td>
42-
</tr>
43-
<tr>
44-
<td>Flyers</td>
45-
<td>PHI</td>
46-
<td>2</td>
47-
</tr>
48-
<tr>
49-
<td>Maple Leafs</td>
50-
<td>TOR</td>
51-
<td>13</td>
52-
</tr>
53-
</tbody>
54-
</table>"""
31+
EXPECTED_TABLE_XLSX = (
32+
"<table>"
33+
"<tr><td>Team</td><td>Location</td><td>Stanley Cups</td></tr>"
34+
"<tr><td>Blues</td><td>STL</td><td>1</td></tr>"
35+
"<tr><td>Flyers</td><td>PHI</td><td>2</td></tr>"
36+
"<tr><td>Maple Leafs</td><td>TOR</td><td>13</td></tr>"
37+
"</table>"
38+
)
5539

5640
EXPECTED_TITLE = "Stanley Cups"
5741

@@ -139,86 +123,76 @@
139123
</table>"""
140124

141125
EXPECTED_XLS_TABLE = (
142-
"""<table border="1" class="dataframe">
143-
<tbody>
144-
<tr>
145-
<td>MC</td>
146-
<td>What is 2+2?</td>
147-
<td>4</td>
148-
<td>correct</td>
149-
<td>3</td>
150-
<td>incorrect</td>
151-
<td></td>
152-
<td></td>
153-
<td></td>
154-
</tr>
155-
<tr>
156-
<td>MA</td>
157-
<td>What C datatypes are 8 bits? (assume i386)</td>
158-
<td>int</td>
159-
<td></td>
160-
<td>float</td>
161-
<td></td>
162-
<td>double</td>
163-
<td></td>
164-
<td>char</td>
165-
</tr>
166-
<tr>
167-
<td>TF</td>
168-
<td>Bagpipes are awesome.</td>
169-
<td>true</td>
170-
<td></td>
171-
<td></td>
172-
<td></td>
173-
<td></td>
174-
<td></td>
175-
<td></td>
176-
</tr>
177-
<tr>
178-
<td>ESS</td>
179-
<td>How have the original Henry Hornbostel buildings """
180-
"""influenced campus architecture and design in the last 30 years?</td>
181-
<td></td>
182-
<td></td>
183-
<td></td>
184-
<td></td>
185-
<td></td>
186-
<td></td>
187-
<td></td>
188-
</tr>
189-
<tr>
190-
<td>ORD</td>
191-
<td>Rank the following in their order of operation.</td>
192-
<td>Parentheses</td>
193-
<td>Exponents</td>
194-
<td>Division</td>
195-
<td>Addition</td>
196-
<td></td>
197-
<td></td>
198-
<td></td>
199-
</tr>
200-
<tr>
201-
<td>FIB</td>
202-
<td>The student activities fee is</td>
203-
<td>95</td>
204-
<td>dollars for students enrolled in</td>
205-
<td>19</td>
206-
<td>units or more,</td>
207-
<td></td>
208-
<td></td>
209-
<td></td>
210-
</tr>
211-
<tr>
212-
<td>MAT</td>
213-
<td>Match the lower-case greek letter with its capital form.</td>
214-
<td>λ</td>
215-
<td>Λ</td>
216-
<td>α</td>
217-
<td>γ</td>
218-
<td>Γ</td>
219-
<td>φ</td>
220-
<td>Φ</td>
221-
</tr>
222-
</tbody>
223-
</table>"""
126+
"<table><tr>"
127+
"<td>MC</td>"
128+
"<td>What is 2+2?</td>"
129+
"<td>4</td>"
130+
"<td>correct</td>"
131+
"<td>3</td>"
132+
"<td>incorrect</td>"
133+
"<td/>"
134+
"<td/>"
135+
"<td/>"
136+
"</tr><tr>" # -----
137+
"<td>MA</td>"
138+
"<td>What C datatypes are 8 bits? (assume i386)</td>"
139+
"<td>int</td>"
140+
"<td/>"
141+
"<td>float</td>"
142+
"<td/>"
143+
"<td>double</td>"
144+
"<td/>"
145+
"<td>char</td>"
146+
"</tr><tr>" # -----
147+
"<td>TF</td>"
148+
"<td>Bagpipes are awesome.</td>"
149+
"<td>true</td>"
150+
"<td/>"
151+
"<td/>"
152+
"<td/>"
153+
"<td/>"
154+
"<td/>"
155+
"<td/>"
156+
"</tr><tr>" # -----
157+
"<td>ESS</td>"
158+
"<td>How have the original Henry Hornbostel buildings influenced campus architecture and"
159+
" design in the last 30 years?</td>"
160+
"<td/>"
161+
"<td/>"
162+
"<td/>"
163+
"<td/>"
164+
"<td/>"
165+
"<td/>"
166+
"<td/>"
167+
"</tr><tr>" # -----
168+
"<td>ORD</td>"
169+
"<td>Rank the following in their order of operation.</td>"
170+
"<td>Parentheses</td>"
171+
"<td>Exponents</td>"
172+
"<td>Division</td>"
173+
"<td>Addition</td>"
174+
"<td/>"
175+
"<td/>"
176+
"<td/>"
177+
"</tr><tr>" # -----
178+
"<td>FIB</td>"
179+
"<td>The student activities fee is</td>"
180+
"<td>95</td>"
181+
"<td>dollars for students enrolled in</td>"
182+
"<td>19</td>"
183+
"<td>units or more,</td>"
184+
"<td/>"
185+
"<td/>"
186+
"<td/>"
187+
"</tr><tr>" # -----
188+
"<td>MAT</td>"
189+
"<td>Match the lower-case greek letter with its capital form.</td>"
190+
"<td>λ</td>"
191+
"<td>Λ</td>"
192+
"<td>α</td>"
193+
"<td>γ</td>"
194+
"<td>Γ</td>"
195+
"<td>φ</td>"
196+
"<td>Φ</td>"
197+
"</tr></table>"
224198
)

0 commit comments

Comments
 (0)