@@ -22,6 +22,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
2222file type and route it to the appropriate partitioning brick. All partitioning bricks
2323called within ``partition `` are called using the defualt kwargs. Use the document-type
2424specific bricks if you need to apply non-default settings.
25+ ``partition `` currently supports ``.docx ``, ``.eml ``, ``.html ``, ``.pdf ``, and ``.txt `` files.
2526
2627
2728.. code :: python
@@ -104,7 +105,7 @@ Examples:
104105 ``partition_pdf ``
105106---------------------
106107
107- The ``partition_pdf `` function segments a PDF document by calling the document image analysis API.
108+ The ``partition_pdf `` function segments a PDF document by calling the document image analysis API.
108109The intent of the parameters ``url `` and ``token `` is to allow users to self host an inference API,
109110if desired.
110111
@@ -122,7 +123,7 @@ Examples:
122123---------------------
123124
124125The ``partition_email `` function partitions ``.eml `` documents and works with exports
125- from email clients such as Microsoft Outlook and Gmail. The ``partition_email ``
126+ from email clients such as Microsoft Outlook and Gmail. The ``partition_email ``
126127takes a filename, file-like object, or raw text as input and produces a list of
127128document ``Element `` objects as output. Also ``content_source `` can be set to ``text/html ``
128129(default) or ``text/plain `` to process the html or plain text version of the email, respectively.
@@ -157,7 +158,7 @@ Examples:
157158 ``partition_text ``
158159---------------------
159160
160- The ``partition_text `` function partitions text files. The ``partition_text ``
161+ The ``partition_text `` function partitions text files. The ``partition_text ``
161162takes a filename, file-like object, and raw text as input and produces ``Element `` objects as output.
162163
163164Examples:
@@ -629,7 +630,7 @@ addresses in the input string.
629630
630631 from unstructured.cleaners.extract import extract_email_address
631632
632- 633+ 633634 ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
634635
635636@@ -646,7 +647,7 @@ returns a list of all IP address in input string.
646647
647648 from unstructured.cleaners.extract import extract_ip_address
648649
649- 650+ 650651 ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
651652
652653 # Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
@@ -656,7 +657,7 @@ returns a list of all IP address in input string.
656657 ``extract_ip_address_name ``
657658----------------------------
658659
659- Extracts the names of each IP address in the ``Received `` field(s) from an ``.eml ``
660+ Extracts the names of each IP address in the ``Received `` field(s) from an ``.eml ``
660661file. ``extract_ip_address_name `` takes in a string and returns a list of all
661662IP addresses in the input string.
662663
@@ -675,7 +676,7 @@ IP addresses in the input string.
675676 ``extract_mapi_id ``
676677----------------------
677678
678- Extracts the ``mapi id `` in the ``Received `` field(s) from an ``.eml ``
679+ Extracts the ``mapi id `` in the ``Received `` field(s) from an ``.eml ``
679680file. ``extract_mapi_id `` takes in a string and returns a list of a string
680681containing the ``mapi id `` in the input string.
681682
@@ -694,7 +695,7 @@ containing the ``mapi id`` in the input string.
694695 ``extract_datetimetz ``
695696----------------------
696697
697- Extracts the date, time, and timezone in the ``Received `` field(s) from an ``.eml ``
698+ Extracts the date, time, and timezone in the ``Received `` field(s) from an ``.eml ``
698699file. ``extract_datetimetz `` takes in a string and returns a datetime.datetime
699700object from the input string.
700701
@@ -754,7 +755,7 @@ other languages.
754755Parameters:
755756
756757* ``text ``: the input string to translate.
757- * ``source_lang ``: the two letter language code for the source language of the text.
758+ * ``source_lang ``: the two letter language code for the source language of the text.
758759 If ``source_lang `` is not specified,
759760 the language will be detected using ``langdetect ``.
760761* ``target_lang ``: the two letter language code for the target language for translation.
@@ -857,7 +858,7 @@ Examples:
857858--------------------------
858859
859860Prepares ``Text `` elements for processing in ``transformers `` pipelines
860- by splitting the elements into chunks that fit into the model's attention window.
861+ by splitting the elements into chunks that fit into the model's attention window.
861862
862863Examples:
863864
@@ -960,7 +961,7 @@ Examples:
960961 json.dump(label_studio_data, f, indent = 4 )
961962
962963
963- You can also include pre-annotations and predictions as part of your LabelStudio upload.
964+ You can also include pre-annotations and predictions as part of your LabelStudio upload.
964965
965966The ``annotations `` kwarg is a list of lists. If ``annotations `` is specified, there must be a list of
966967annotations for each element in the ``elements `` list. If an element does not have any annotations,
@@ -1009,7 +1010,7 @@ task in LabelStudio:
10091010
10101011 Similar to annotations, the ``predictions `` kwarg is also a list of lists. A ``prediction `` is an annotation with
10111012the addition of a ``score `` value. If ``predictions `` is specified, there must be a list of
1012- predictions for each element in the ``elements `` list. If an element does not have any predictions, use an empty list.
1013+ predictions for each element in the ``elements `` list. If an element does not have any predictions, use an empty list.
10131014The following shows an example of how to upload predictions for the "Text Classification"
10141015task in LabelStudio:
10151016
@@ -1167,13 +1168,13 @@ Examples:
11671168 ``stage_for_label_box ``
11681169--------------------------
11691170
1170- Formats outputs for use with `LabelBox <https://docs.labelbox.com/docs/overview >`_. LabelBox accepts cloud-hosted data
1171+ Formats outputs for use with `LabelBox <https://docs.labelbox.com/docs/overview >`_. LabelBox accepts cloud-hosted data
11711172and does not support importing text directly. The ``stage_for_label_box `` does the following:
11721173
11731174* Stages the data files in the ``output_directory `` specified in function arguments to be uploaded to a cloud storage service.
11741175* Returns a config of type ``List[Dict[str, Any]] `` that can be written to a ``json `` file and imported into LabelBox.
11751176
1176- **Note: ** ``stage_for_label_box `` does not upload the data to remote storage such as S3. Users can upload the data to S3
1177+ **Note: ** ``stage_for_label_box `` does not upload the data to remote storage such as S3. Users can upload the data to S3
11771178using ``aws s3 sync ${output_directory} ${url_prefix} `` after running the ``stage_for_label_box `` staging brick.
11781179
11791180Examples:
@@ -1197,7 +1198,7 @@ files to an S3 bucket.
11971198
11981199 # The URL prefix where the data files will be accessed.
11991200 S3_URL_PREFIX = f " https:// { S3_BUCKET_NAME } .s3.amazonaws.com/ { S3_BUCKET_KEY_PREFIX } "
1200-
1201+
12011202 # The local output directory where the data files will be staged for uploading to a Cloud Storage service.
12021203 LOCAL_OUTPUT_DIRECTORY = " /tmp/labelbox-staging"
12031204
@@ -1232,7 +1233,7 @@ files to an S3 bucket.
12321233--------------------------
12331234Formats a list of ``Text `` elements as input to token based tasks in Datasaur.
12341235
1235- Example:
1236+ Example:
12361237
12371238.. code :: python
12381239
@@ -1243,7 +1244,7 @@ Example:
12431244 datasaur_data = stage_for_datasaur(elements)
12441245
12451246 The output is a list of dictionaries, each one with two keys:
1246- "text" with the content of the element and
1247+ "text" with the content of the element and
12471248"entities" with an empty list.
12481249
12491250You can also specify specify entities in the ``stage_for_datasaur `` brick. Entities
0 commit comments