You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can also instantiate a document directly from an HTML string using the ``from_string`` method.
192
-
The output of this will be the following:
193
-
194
-
.. code:: python
195
-
196
-
"""
197
-
SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS
198
-
This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.
199
-
200
-
Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking statements, are subject to change and to inherent known and unknown risks and uncertainties. You should not assume at any point in the future that the forward-looking statements in this report are still valid. We do not intend, and undertake no obligation, to update our forward-looking statements to reflect future events or circumstances.
201
-
"""
202
-
203
-
If you then run:
204
-
205
-
.. code:: python
206
-
207
-
doc.pages[2].elements
208
-
209
-
You'll get the following output, showing that the parser successfully differentiated between
210
-
titles and narrative text.
211
-
212
-
.. code:: python
213
-
214
-
[<unstructured.documents.base.Title at 0x169cbe820>,
215
-
<unstructured.documents.base.NarrativeText at 0x169cbe8e0>,
216
-
<unstructured.documents.base.NarrativeText at 0x169cbe3a0>]
217
-
218
-
219
-
Creating HTML from XML with XSLT
220
-
--------------------------------
221
-
222
-
You can also convert XML files to HTML with the appropriate XSLT stylesheet. Note, XSLT
223
-
converts arbitrary XML to XML, so there's no guarantee the result will be HTML. Ensure
224
-
you're using a stylesheet designed to convert your specific XML to HTML. The workflow
225
-
for reading in a document with an XSLT stylesheet is as follows:
226
-
227
-
.. code:: python
228
-
229
-
from unstructured.document.html import HTMLDocument
Before running the code in this make sure you've installed the ``unstructured`` library
16
+
and all dependencies using the instructions in the **Quick Start** section.
17
+
18
+
19
+
#######################
20
+
Partitioning a document
21
+
#######################
22
+
23
+
In this section, we'll cut right to the chase and get to the most important part of the library: partitioning a document.
24
+
The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections,
25
+
and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for
26
+
partitioning a document. We'll cover those in a later section. For now, we'll use the simplest API in the library,
27
+
the ``partition`` function. The ``partition`` function will detect the filetype of the source document and route it to the appropriate
28
+
partitioning function. You can try out the partition function by running the cell below.
29
+
30
+
31
+
.. code:: python
32
+
33
+
34
+
from unstructured.partition.auto import partition
35
+
36
+
elements = partition(filename="example-10k.html")
37
+
38
+
39
+
You can also pass in a file as a file-like object using the following workflow:
40
+
41
+
42
+
.. code:: python
43
+
44
+
withopen("example-10k.html", "rb") as f:
45
+
elements = partition(file=f)
46
+
47
+
48
+
The ``partition`` function uses `libmagic <https://formulae.brew.sh/formula/libmagic>`_ for filetype detection. If ``libmagic`` is
49
+
not present and the user passes a filename, ``partition`` falls back to detecting the filetype using the file extension.
50
+
``libmagic`` is required if you'd lke to pass a file-like object to ``partition``.
51
+
We highly recommend installing ``libmagic`` and you may observe different file detection behaviors
52
+
if ``libmagic`` is not installed`.
53
+
54
+
55
+
##################
56
+
Document elements
57
+
##################
58
+
59
+
60
+
When we partition a document, the output is a list of document ``Element`` objects.
61
+
These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:
62
+
63
+
64
+
65
+
* ``Element``
66
+
* ``Text``
67
+
* ``FigureCaption``
68
+
* ``NarrativeText``
69
+
* ``ListItem``
70
+
* ``Title``
71
+
* ``Address``
72
+
* ``PageBreak``
73
+
* ``CheckBox``
74
+
* ``Image``
75
+
76
+
77
+
Other element types that we will add in the future include tables and figures.
78
+
Different partitioning functions use different methods for determining the element type and extracting the associated content.
79
+
Document elements have a ``str`` representation. You can print them using the snippet below.
80
+
81
+
82
+
83
+
.. code:: python
84
+
85
+
elements = partition(filename="example-10k.html")
86
+
87
+
for element in elements[:5]:
88
+
print(element)
89
+
print("\n")
90
+
91
+
92
+
One helpful aspect of document elements is that they allow you to cut a document down to the elements that you need for your particular use case.
93
+
For example, if you're training a summarization model you may only want to include narrative text for model training.
94
+
You'll notice that the output above includes a lot of titles and other content that may not be suitable for a summarization model.
95
+
The following code shows how you can limit your output to only narrative text with at least two sentences. As you can see, the output now only contains narrative text.
96
+
97
+
98
+
99
+
.. code:: python
100
+
101
+
from unstructured.documents.elements import NarrativeText
102
+
from unstructured.partition.text_type import sentence_count
103
+
104
+
for element in elements[:100]:
105
+
ifisinstance(element, NarrativeText) and sentence_count(element.text) >2:
106
+
print(element)
107
+
print("\n")
108
+
109
+
110
+
###########################################
111
+
Converting elements to a dictionary or JSON
112
+
###########################################
113
+
114
+
The final step in the process for most users is to convert the output to JSON.
115
+
You can convert a list of document elements to a list of dictionaries using the ``convert_to_dict`` function.
116
+
The workflow for using ``convert_to_dict`` appears below.
117
+
118
+
119
+
.. code:: python
120
+
121
+
122
+
from unstructured.staging.base import convert_to_dict
123
+
124
+
convert_to_dict(elements)
125
+
126
+
127
+
The ``unstructured`` library also includes utilities for saving a list of elements to JSON and reading
128
+
a list of elements from JSON, as seen in the snippet below
129
+
130
+
131
+
132
+
.. code:: python
133
+
134
+
from unstructured.staging.base import elements_to_json, elements_from_json
135
+
136
+
137
+
filename ="outputs.json"
138
+
elements_to_json(elements, filename=filename)
139
+
elements = elements_from_json(filename=filename)
140
+
141
+
142
+
143
+
##################
144
+
Wrapping it all up
145
+
##################
146
+
147
+
To conclude, the basic workflow for reading in a document and converting it to a JSON in ``unstructured``
148
+
looks like the following:
149
+
150
+
151
+
152
+
.. code:: python
153
+
154
+
from unstructured.partition.auto import partition
155
+
from unstructured.staging.base import elements_to_json
Copy file name to clipboardExpand all lines: examples/training/0-Core Concepts.ipynb
+3-33Lines changed: 3 additions & 33 deletions
Original file line number
Diff line number
Diff line change
@@ -274,7 +274,7 @@
274
274
"source": [
275
275
"## Converting to a dictionary <a id=\"dict\"></a>\n",
276
276
"\n",
277
-
"The final step in the process for most users is to convert the output to JSON. You can convert a list of document elements to a list of dictionaries using the `convert_to_isd` function. ISD stands for \"initial structured data\", our common format for representing text data. The workflow for using `convert_to_isd` appears below."
277
+
"The final step in the process for most users is to convert the output to JSON. You can convert a list of document elements to a list of dictionaries using the `convert_to_dict` function. The workflow for using `convert_to_dict` appears below."
0 commit comments