Skip to content

Commit b803711

Browse files
authored
feat: add partition_xlsx for MSFT Excel files (#594)
* first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag
1 parent 830d67f commit b803711

File tree

11 files changed

+223
-10
lines changed

11 files changed

+223
-10
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010

1111
### Features
1212

13+
* Add `partition_xlsx` for Microsoft Excel documents.
14+
1315
### Fixes
1416

1517
* Supports `hml` filetype for partition as a variation of html filetype.

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCj
183183

184184
The following examples show how to get started with the `unstructured` library.
185185
You can parse **TXT**, **HTML**, **PDF**, **EML**, **MSG**, **RTF**, **EPUB**, **DOC**, **DOCX**,
186-
**ODT**, **PPT**, **PPTX**, **JPG**,
186+
**XLSX**, **ODT**, **PPT**, **PPTX**, **JPG**,
187187
and **PNG** documents with one line of code!
188188
<br></br>
189189
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
@@ -198,7 +198,7 @@ If you are using the `partition` brick, you may need to install additional param
198198
instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection)
199199
`partition` will always apply the default arguments. If you need
200200
advanced features, use a document-specific brick. The `partition` brick currently works for
201-
`.txt`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.jpg`, `.png`, `.eml`, `.msg`, `.html`, and `.pdf` documents.
201+
`.txt`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xlsx`, `.jpg`, `.png`, `.eml`, `.msg`, `.html`, and `.pdf` documents.
202202

203203
```python
204204
from unstructured.partition.auto import partition

docs/source/bricks.rst

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
8383
file type and route it to the appropriate partitioning brick. All partitioning bricks
8484
called within ``partition`` are called using the default kwargs. Use the document-type
8585
specific bricks if you need to apply non-default settings.
86-
``partition`` currently supports ``.docx``, ``.doc``, ``.odt``, ``.pptx``, ``.ppt``, ``.eml``, ``.msg``, ``.rtf``, ``.epub``, ``.html``, ``.pdf``,
86+
``partition`` currently supports ``.docx``, ``.doc``, ``.odt``, ``.pptx``, ``.ppt``, ``.xlsx``, ``.eml``, ``.msg``, ``.rtf``, ``.epub``, ``.html``, ``.pdf``,
8787
``.png``, ``.jpg``, and ``.txt`` files.
8888
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
8989
``.png``, and ``.jpg``.
@@ -251,6 +251,24 @@ Examples:
251251
elements = partition_doc(filename="example-docs/fake.doc")
252252
253253
254+
``partition_xlsx``
255+
------------------
256+
257+
The ``partition_xlsx`` function pre-processes Microsoft Excel documents. Each
258+
sheet in the Excel file will be stored as a ``Table`` object. The plain text
259+
of the sheet will be the ``text`` attribute of the ``Table``. The ``text_as_html``
260+
attribute in the element metadata will contain an HTML representation of the table.
261+
262+
Examples:
263+
264+
.. code:: python
265+
266+
from unstructured.partition.xlsx import partition_xlsx
267+
268+
elements = partition_xlsx(filename="example-docs/stanley-cups.xlsx")
269+
print(elements[0].metadata.text_as_html)
270+
271+
254272
``partition_odt``
255273
------------------
256274

example-docs/stanley-cups.xlsx

6.19 KB
Binary file not shown.
-4.65 KB
Binary file not shown.

test_unstructured/file_utils/test_filetype.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
("unsupported/factbook.xml", FileType.XML),
3030
("example-10k.html", FileType.HTML),
3131
("fake-html.html", FileType.HTML),
32-
("unsupported/fake-excel.xlsx", FileType.XLSX),
32+
("stanley-cups.xlsx", FileType.XLSX),
3333
("fake-power-point.pptx", FileType.PPTX),
3434
("winter-sports.epub", FileType.EPUB),
3535
("spring-weather.html.json", FileType.JSON),
@@ -52,7 +52,7 @@ def test_detect_filetype_from_filename(file, expected):
5252
("unsupported/factbook.xml", FileType.XML),
5353
("example-10k.html", FileType.HTML),
5454
("fake-html.html", FileType.HTML),
55-
("unsupported/fake-excel.xlsx", FileType.XLSX),
55+
("stanley-cups.xlsx", FileType.XLSX),
5656
("fake-power-point.pptx", FileType.PPTX),
5757
("winter-sports.epub", FileType.EPUB),
5858
("fake-doc.rtf", FileType.RTF),
@@ -87,7 +87,7 @@ def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expecte
8787
# */xml and some return */html. Either could be acceptable depending on the OS
8888
("example-10k.html", [FileType.HTML, FileType.XML]),
8989
("fake-html.html", FileType.HTML),
90-
("unsupported/fake-excel.xlsx", FileType.XLSX),
90+
("stanley-cups.xlsx", FileType.XLSX),
9191
("fake-power-point.pptx", FileType.PPTX),
9292
("winter-sports.epub", FileType.EPUB),
9393
],
@@ -192,15 +192,15 @@ def test_detect_xls_file_from_mime_type(monkeypatch):
192192

193193
def test_detect_xlsx_filetype_application_octet_stream(monkeypatch):
194194
monkeypatch.setattr(magic, "from_buffer", lambda *args, **kwargs: "application/octet-stream")
195-
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "unsupported", "fake-excel.xlsx")
195+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "stanley-cups.xlsx")
196196
with open(filename, "rb") as f:
197197
filetype = detect_filetype(file=f)
198198
assert filetype == FileType.XLSX
199199

200200

201201
def test_detect_xlsx_filetype_application_octet_stream_with_filename(monkeypatch):
202202
monkeypatch.setattr(magic, "from_file", lambda *args, **kwargs: "application/octet-stream")
203-
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "unsupported", "fake-excel.xlsx")
203+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "stanley-cups.xlsx")
204204
filetype = detect_filetype(filename=filename)
205205
assert filetype == FileType.XLSX
206206

@@ -246,7 +246,7 @@ def test_detect_docx_filetype_word_mime_type(monkeypatch):
246246

247247
def test_detect_xlsx_filetype_word_mime_type(monkeypatch):
248248
monkeypatch.setattr(magic, "from_file", lambda *args, **kwargs: XLSX_MIME_TYPES[0])
249-
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "unsupported", "fake-excel.xlsx")
249+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "stanley-cups.xlsx")
250250
with open(filename, "rb") as f:
251251
filetype = detect_filetype(file=f)
252252
assert filetype == FileType.XLSX

test_unstructured/partition/test_auto.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,14 @@
99
import pypandoc
1010
import pytest
1111

12+
from unstructured.cleaners.core import clean_extra_whitespace
1213
from unstructured.documents.elements import (
1314
Address,
1415
ElementMetadata,
1516
ListItem,
1617
NarrativeText,
1718
PageBreak,
19+
Table,
1820
Text,
1921
Title,
2022
)
@@ -609,3 +611,59 @@ def test_file_specific_produces_correct_filetype(filetype: FileType):
609611
elements = fun(str(file))
610612
assert all(el.metadata.filetype == FILETYPE_TO_MIMETYPE[filetype] for el in elements)
611613
break
614+
615+
616+
EXPECTED_XLSX_TABLE = """<table border="1" class="dataframe">
617+
<tbody>
618+
<tr>
619+
<td>Team</td>
620+
<td>Location</td>
621+
<td>Stanley Cups</td>
622+
</tr>
623+
<tr>
624+
<td>Blues</td>
625+
<td>STL</td>
626+
<td>1</td>
627+
</tr>
628+
<tr>
629+
<td>Flyers</td>
630+
<td>PHI</td>
631+
<td>2</td>
632+
</tr>
633+
<tr>
634+
<td>Maple Leafs</td>
635+
<td>TOR</td>
636+
<td>13</td>
637+
</tr>
638+
</tbody>
639+
</table>"""
640+
641+
642+
EXPECTED_XLSX_TEXT = "Team Location Stanley Cups Blues STL 1 Flyers PHI 2 Maple Leafs TOR 13"
643+
644+
EXPECTED_XLSX_FILETYPE = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
645+
646+
647+
def test_auto_partition_xlsx_from_filename(filename="example-docs/stanley-cups.xlsx"):
648+
elements = partition(filename=filename)
649+
650+
assert all(isinstance(element, Table) for element in elements)
651+
assert len(elements) == 2
652+
653+
assert clean_extra_whitespace(elements[0].text) == EXPECTED_XLSX_TEXT
654+
assert elements[0].metadata.text_as_html == EXPECTED_XLSX_TABLE
655+
assert elements[0].metadata.page_number == 1
656+
assert elements[0].metadata.filetype == EXPECTED_XLSX_FILETYPE
657+
658+
659+
def test_auto_partition_xlsx_from_file(filename="example-docs/stanley-cups.xlsx"):
660+
with open(filename, "rb") as f:
661+
elements = partition(file=f)
662+
663+
assert all(isinstance(element, Table) for element in elements)
664+
assert len(elements) == 2
665+
666+
assert clean_extra_whitespace(elements[0].text) == EXPECTED_XLSX_TEXT
667+
assert elements[0].metadata.text_as_html == EXPECTED_XLSX_TABLE
668+
assert elements[0].metadata.page_number == 1
669+
assert elements[0].metadata.filetype == EXPECTED_XLSX_FILETYPE
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
from unstructured.cleaners.core import clean_extra_whitespace
2+
from unstructured.documents.elements import Table
3+
from unstructured.partition.xlsx import partition_xlsx
4+
5+
EXPECTED_TABLE = """<table border="1" class="dataframe">
6+
<tbody>
7+
<tr>
8+
<td>Team</td>
9+
<td>Location</td>
10+
<td>Stanley Cups</td>
11+
</tr>
12+
<tr>
13+
<td>Blues</td>
14+
<td>STL</td>
15+
<td>1</td>
16+
</tr>
17+
<tr>
18+
<td>Flyers</td>
19+
<td>PHI</td>
20+
<td>2</td>
21+
</tr>
22+
<tr>
23+
<td>Maple Leafs</td>
24+
<td>TOR</td>
25+
<td>13</td>
26+
</tr>
27+
</tbody>
28+
</table>"""
29+
30+
31+
EXPECTED_TEXT = "Team Location Stanley Cups Blues STL 1 Flyers PHI 2 Maple Leafs TOR 13"
32+
33+
EXPECTED_FILETYPE = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
34+
35+
36+
def test_partition_xlsx_from_filename(filename="example-docs/stanley-cups.xlsx"):
37+
elements = partition_xlsx(filename=filename)
38+
39+
assert all(isinstance(element, Table) for element in elements)
40+
assert len(elements) == 2
41+
42+
assert clean_extra_whitespace(elements[0].text) == EXPECTED_TEXT
43+
assert elements[0].metadata.text_as_html == EXPECTED_TABLE
44+
assert elements[0].metadata.page_number == 1
45+
assert elements[0].metadata.filetype == EXPECTED_FILETYPE
46+
47+
48+
def test_partition_xlsx_from_file(filename="example-docs/stanley-cups.xlsx"):
49+
with open(filename, "rb") as f:
50+
elements = partition_xlsx(file=f)
51+
52+
assert all(isinstance(element, Table) for element in elements)
53+
assert len(elements) == 2
54+
55+
assert clean_extra_whitespace(elements[0].text) == EXPECTED_TEXT
56+
assert elements[0].metadata.text_as_html == EXPECTED_TABLE
57+
assert elements[0].metadata.page_number == 1
58+
assert elements[0].metadata.filetype == EXPECTED_FILETYPE
59+
60+
61+
def test_partition_xlsx_can_exclude_metadata(filename="example-docs/stanley-cups.xlsx"):
62+
elements = partition_xlsx(filename=filename, include_metadata=False)
63+
64+
assert all(isinstance(element, Table) for element in elements)
65+
assert len(elements) == 2
66+
67+
assert clean_extra_whitespace(elements[0].text) == EXPECTED_TEXT
68+
assert elements[0].metadata.text_as_html is None
69+
assert elements[0].metadata.page_number is None
70+
assert elements[0].metadata.filetype is None

unstructured/file_utils/filetype.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,6 @@
7777
]
7878

7979
EXPECTED_XLSX_FILES = [
80-
"docProps/core.xml",
8180
"xl/workbook.xml",
8281
]
8382

unstructured/partition/auto.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
from unstructured.partition.pptx import partition_pptx
2727
from unstructured.partition.rtf import partition_rtf
2828
from unstructured.partition.text import partition_text
29+
from unstructured.partition.xlsx import partition_xlsx
2930

3031

3132
def partition(
@@ -183,6 +184,8 @@ def partition(
183184
)
184185
elif filetype == FileType.JSON:
185186
elements = partition_json(filename=filename, file=file)
187+
elif filetype == FileType.XLSX:
188+
elements = partition_xlsx(filename=filename, file=file)
186189
else:
187190
msg = "Invalid file" if not filename else f"Invalid file {filename}"
188191
raise ValueError(f"{msg}. The {filetype} file type is not supported in partition.")

0 commit comments

Comments
 (0)