Skip to content

Commit 6036af3

Browse files
authored
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates
1 parent 9bbd4a1 commit 6036af3

File tree

13 files changed

+238
-8
lines changed

13 files changed

+238
-8
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ jobs:
9898
source .venv/bin/activate
9999
make install-nltk-models
100100
make install-detectron2
101-
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr
101+
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
102102
make test
103103
make check-coverage
104104
make install-ingest-s3

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.4.11-dev0
2+
3+
* Adds `partition_doc` for partition Word documents in `.doc` format. Requires `libreoffice`.
4+
15
## 0.4.10
26

37
* Fixes `ElementMetadata` so that it's JSON serializable when the filename is a `Path` object.

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ To install the library, run `pip install unstructured`.
7878
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.
7979

8080
The following examples show how to get started with the `unstructured` library.
81-
You can parse **TXT**, **HTML**, **PDF**, **EML** and **DOCX** documents with one line of code!
81+
You can parse **TXT**, **HTML**, **PDF**, **EML** **DOC** and **DOCX** documents with one line of code!
8282
<br></br>
8383
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
8484
of the features in the library.
@@ -92,7 +92,7 @@ If you are using the `partition` brick, you may need to install additional param
9292
instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection)
9393
`partition` will always apply the default arguments. If you need
9494
advanced features, use a document-specific brick. The `partition` brick currently works for
95-
`.txt`, `.docx`, `.pptx`, `.jpg`, `.png`, `.eml`, `.html`, and `.pdf` documents.
95+
`.txt`, `.doc`, `.docx`, `.pptx`, `.jpg`, `.png`, `.eml`, `.html`, and `.pdf` documents.
9696

9797
```python
9898
from unstructured.partition.auto import partition

docs/source/bricks.rst

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
2222
file type and route it to the appropriate partitioning brick. All partitioning bricks
2323
called within ``partition`` are called using the defualt kwargs. Use the document-type
2424
specific bricks if you need to apply non-default settings.
25-
``partition`` currently supports ``.docx``, ``.pptx``, ``.eml``, ``.html``, ``.pdf``,
25+
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.eml``, ``.html``, ``.pdf``,
2626
``.png``, ``.jpg``, and ``.txt`` files.
2727
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
2828
``.png``, and ``.jpg``.
@@ -81,6 +81,28 @@ Examples:
8181
with open("mydoc.docx", "rb") as f:
8282
elements = partition_docx(file=f)
8383
84+
85+
``partition_doc``
86+
------------------
87+
88+
The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents
89+
saved in the ``.doc`` format. This staging brick uses a combination of the styling
90+
information in the document and the structure of the text to determine the type
91+
of a text element. The ``partition_doc`` can take a filename or file-like object
92+
as input, as shown in the two examples below. ``partiton_doc``
93+
uses ``libreoffice`` to convert the file to ``.docx`` and then
94+
calls ``partition_docx``. Ensure you have ``libreoffice`` installed
95+
before using ``partition_doc``.
96+
97+
Examples:
98+
99+
.. code:: python
100+
101+
from unstructured.partition.doc import partition_doc
102+
103+
elements = partition_doc(filename="example-docs/fake.doc")
104+
105+
84106
``partition_pptx``
85107
---------------------
86108

example-docs/fake.doc

18 KB
Binary file not shown.

test_unstructured/partition/test_auto.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from unstructured.documents.elements import Address, NarrativeText, PageBreak, Title, Text, ListItem
99
from unstructured.partition.auto import partition
1010
import unstructured.partition.auto as auto
11+
from unstructured.partition.common import convert_office_doc
1112

1213
DIRECTORY = pathlib.Path(__file__).parent.resolve()
1314
EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, "..", "..", "example-docs")
@@ -96,6 +97,30 @@ def test_auto_partition_docx_with_file(mock_docx_document, expected_docx_element
9697
assert elements == expected_docx_elements
9798

9899

100+
def test_auto_partition_doc_with_filename(mock_docx_document, expected_docx_elements, tmpdir):
101+
docx_filename = os.path.join(tmpdir.dirname, "mock_document.docx")
102+
doc_filename = os.path.join(tmpdir.dirname, "mock_document.doc")
103+
mock_docx_document.save(docx_filename)
104+
convert_office_doc(docx_filename, tmpdir.dirname, "doc")
105+
106+
elements = partition(filename=doc_filename)
107+
assert elements == expected_docx_elements
108+
109+
110+
# NOTE(robinson) - the application/x-ole-storage mime type is not specific enough to
111+
# determine that the file is an .doc document
112+
@pytest.mark.xfail
113+
def test_auto_partition_doc_with_file(mock_docx_document, expected_docx_elements, tmpdir):
114+
docx_filename = os.path.join(tmpdir.dirname, "mock_document.docx")
115+
doc_filename = os.path.join(tmpdir.dirname, "mock_document.doc")
116+
mock_docx_document.save(docx_filename)
117+
convert_office_doc(docx_filename, tmpdir.dirname, "doc")
118+
119+
with open(doc_filename, "rb") as f:
120+
elements = partition(file=f)
121+
assert elements == expected_docx_elements
122+
123+
99124
def test_auto_partition_html_from_filename():
100125
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "example-10k.html")
101126
elements = partition(filename=filename)
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
import os
2+
import pytest
3+
4+
import docx
5+
6+
from unstructured.documents.elements import Address, ListItem, NarrativeText, Title, Text
7+
from unstructured.partition.common import convert_office_doc
8+
from unstructured.partition.doc import partition_doc
9+
from unstructured.partition.docx import partition_docx
10+
11+
12+
@pytest.fixture
13+
def mock_document():
14+
document = docx.Document()
15+
16+
document.add_paragraph("These are a few of my favorite things:", style="Heading 1")
17+
# NOTE(robinson) - this should get picked up as a list item due to the •
18+
document.add_paragraph("• Parrots", style="Normal")
19+
# NOTE(robinson) - this should get dropped because it's empty
20+
document.add_paragraph("• ", style="Normal")
21+
document.add_paragraph("Hockey", style="List Bullet")
22+
# NOTE(robinson) - this should get dropped because it's empty
23+
document.add_paragraph("", style="List Bullet")
24+
# NOTE(robinson) - this should get picked up as a title
25+
document.add_paragraph("Analysis", style="Normal")
26+
# NOTE(robinson) - this should get dropped because it is empty
27+
document.add_paragraph("", style="Normal")
28+
# NOTE(robinson) - this should get picked up as a narrative text
29+
document.add_paragraph("This is my first thought. This is my second thought.", style="Normal")
30+
document.add_paragraph("This is my third thought.", style="Body Text")
31+
# NOTE(robinson) - this should just be regular text
32+
document.add_paragraph("2023")
33+
# NOTE(robinson) - this should be an address
34+
document.add_paragraph("DOYLESTOWN, PA 18901")
35+
36+
return document
37+
38+
39+
@pytest.fixture
40+
def expected_elements():
41+
return [
42+
Title("These are a few of my favorite things:"),
43+
ListItem("Parrots"),
44+
ListItem("Hockey"),
45+
Title("Analysis"),
46+
NarrativeText("This is my first thought. This is my second thought."),
47+
NarrativeText("This is my third thought."),
48+
Text("2023"),
49+
Address("DOYLESTOWN, PA 18901"),
50+
]
51+
52+
53+
def test_partition_doc_with_filename(mock_document, expected_elements, tmpdir):
54+
docx_filename = os.path.join(tmpdir.dirname, "mock_document.docx")
55+
doc_filename = os.path.join(tmpdir.dirname, "mock_document.doc")
56+
mock_document.save(docx_filename)
57+
convert_office_doc(docx_filename, tmpdir.dirname, "doc")
58+
59+
elements = partition_doc(filename=doc_filename)
60+
assert elements == expected_elements
61+
62+
63+
def test_partition_doc_matches_partition_docx(mock_document, expected_elements, tmpdir):
64+
docx_filename = os.path.join(tmpdir.dirname, "mock_document.docx")
65+
doc_filename = os.path.join(tmpdir.dirname, "mock_document.doc")
66+
mock_document.save(docx_filename)
67+
convert_office_doc(docx_filename, tmpdir.dirname, "doc")
68+
69+
partition_doc(filename=doc_filename) == partition_docx(filename=docx_filename)
70+
71+
72+
def test_partition_raises_with_missing_doc(mock_document, expected_elements, tmpdir):
73+
doc_filename = os.path.join(tmpdir.dirname, "asdf.doc")
74+
75+
with pytest.raises(ValueError):
76+
partition_doc(filename=doc_filename)
77+
78+
79+
def test_partition_doc_with_file(mock_document, expected_elements, tmpdir):
80+
docx_filename = os.path.join(tmpdir.dirname, "mock_document.docx")
81+
doc_filename = os.path.join(tmpdir.dirname, "mock_document.doc")
82+
mock_document.save(docx_filename)
83+
convert_office_doc(docx_filename, tmpdir.dirname, "doc")
84+
85+
with open(doc_filename, "rb") as f:
86+
elements = partition_doc(file=f)
87+
assert elements == expected_elements
88+
89+
90+
def test_partition_doc_raises_with_both_specified(mock_document, tmpdir):
91+
docx_filename = os.path.join(tmpdir.dirname, "mock_document.docx")
92+
doc_filename = os.path.join(tmpdir.dirname, "mock_document.doc")
93+
mock_document.save(docx_filename)
94+
convert_office_doc(docx_filename, tmpdir.dirname, "doc")
95+
96+
with open(doc_filename, "rb") as f:
97+
with pytest.raises(ValueError):
98+
partition_doc(filename=doc_filename, file=f)
99+
100+
101+
def test_partition_doc_raises_with_neither():
102+
with pytest.raises(ValueError):
103+
partition_doc()

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.10" # pragma: no cover
1+
__version__ = "0.4.11-dev0" # pragma: no cover

unstructured/partition/auto.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from typing import IO, Optional
22

33
from unstructured.file_utils.filetype import detect_filetype, FileType
4+
from unstructured.partition.doc import partition_doc
45
from unstructured.partition.docx import partition_docx
56
from unstructured.partition.email import partition_email
67
from unstructured.partition.html import partition_html
@@ -34,6 +35,8 @@ def partition(
3435
if file is not None:
3536
file.seek(0)
3637

38+
if filetype == FileType.DOC:
39+
return partition_doc(filename=filename, file=file)
3740
if filetype == FileType.DOCX:
3841
return partition_docx(filename=filename, file=file)
3942
elif filetype == FileType.EML:

unstructured/partition/common.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import subprocess
12
from typing import List, Optional, Union
23

34
from unstructured.documents.elements import (
@@ -101,3 +102,32 @@ def add_element_metadata(
101102
element.metadata = metadata
102103
elements.append(element)
103104
return elements
105+
106+
107+
def convert_office_doc(input_filename: str, output_directory: str, target_format: str):
108+
"""Converts a .doc file to a .docx file using the libreoffice CLI."""
109+
# NOTE(robinson) - In the future can also include win32com client as a fallback for windows
110+
# users who do not have LibreOffice installed
111+
# ref: https://stackoverflow.com/questions/38468442/
112+
# multiple-doc-to-docx-file-conversion-using-python
113+
try:
114+
subprocess.call(
115+
[
116+
"soffice",
117+
"--headless",
118+
"--convert-to",
119+
target_format,
120+
"--outdir",
121+
output_directory,
122+
input_filename,
123+
]
124+
)
125+
except FileNotFoundError:
126+
raise FileNotFoundError(
127+
"""soffice command was not found. Please install libreoffice
128+
on your system and try again.
129+
130+
- Install instructions: https://www.libreoffice.org/get-help/install-howto/
131+
- Mac: https://formulae.brew.sh/cask/libreoffice
132+
- Debian: https://wiki.debian.org/LibreOffice"""
133+
)

0 commit comments

Comments
 (0)