You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests
* add ppt support to auto
* version bump
* update docs
* doc fixes
* update changelog
* `.docx` -> `.pptx`
* its -> their
* remove whitespace
Copy file name to clipboardExpand all lines: docs/source/bricks.rst
+22-3Lines changed: 22 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
22
22
file type and route it to the appropriate partitioning brick. All partitioning bricks
23
23
called within ``partition`` are called using the defualt kwargs. Use the document-type
24
24
specific bricks if you need to apply non-default settings.
25
-
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.eml``, ``.html``, ``.pdf``,
25
+
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
26
26
``.png``, ``.jpg``, and ``.txt`` files.
27
27
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
28
28
``.png``, and ``.jpg``.
@@ -89,8 +89,8 @@ The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents
89
89
saved in the ``.doc`` format. This staging brick uses a combination of the styling
90
90
information in the document and the structure of the text to determine the type
91
91
of a text element. The ``partition_doc`` can take a filename or file-like object
92
-
as input, as shown in the two examples below. ``partiton_doc``
93
-
uses ``libreoffice`` to convert the file to ``.docx`` and then
92
+
as input.
93
+
``partiton_doc`` uses ``libreoffice`` to convert the file to ``.docx`` and then
94
94
calls ``partition_docx``. Ensure you have ``libreoffice`` installed
95
95
before using ``partition_doc``.
96
96
@@ -124,6 +124,25 @@ Examples:
124
124
elements = partition_pptx(file=f)
125
125
126
126
127
+
``partition_ppt``
128
+
---------------------
129
+
130
+
The ``partition_ppt`` partitioning brick pre-processes Microsoft PowerPoint documents
131
+
saved in the ``.ppt`` format. This staging brick uses a combination of the styling
132
+
information in the document and the structure of the text to determine the type
133
+
of a text element. The ``partition_ppt`` can take a filename or file-like object.
134
+
``partition_ppt`` uses ``libreoffice`` to convert the file to ``.pptx`` and then
135
+
calls ``partition_pptx``. Ensure you have ``libreoffice`` installed
136
+
before using ``partition_ppt``.
137
+
138
+
Examples:
139
+
140
+
.. code:: python
141
+
142
+
from unstructured.partition.ppt import partition_ppt
143
+
144
+
elements = partition_ppt(filename="example-docs/fake-power-point.ppt")
0 commit comments