Skip to content

Commit b2f37c3

Browse files
LaverdeSqued
andauthored
Docs: add Integrations section (#372)
* docs: update index, add integrations * docs: fix typos * docs: create integrations.rst section structure * docs: descriptions and use for 8 integrations * refactor: SEC example in Label Studio section * Apply suggestions from code review Co-authored-by: qued <[email protected]> * docs: change links order and refactor|paraphrase --------- Co-authored-by: qued <[email protected]>
1 parent b47bfaf commit b2f37c3

File tree

3 files changed

+74
-4
lines changed

3 files changed

+74
-4
lines changed

docs/source/bricks.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ faster processing and `"hi_res"` for
119119
------------------
120120

121121
The ``partition_docx`` partitioning brick pre-processes Microsoft Word documents
122-
saved in the ``.docx`` format. This staging brick uses a combination of the styling
122+
saved in the ``.docx`` format. This partition brick uses a combination of the styling
123123
information in the document and the structure of the text to determine the type
124124
of a text element. The ``partition_docx`` can take a filename or file-like object
125125
as input, as shown in the two examples below.
@@ -148,7 +148,7 @@ Examples:
148148
------------------
149149

150150
The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents
151-
saved in the ``.doc`` format. This staging brick uses a combination of the styling
151+
saved in the ``.doc`` format. This partition brick uses a combination of the styling
152152
information in the document and the structure of the text to determine the type
153153
of a text element. The ``partition_doc`` can take a filename or file-like object
154154
as input.
@@ -169,7 +169,7 @@ Examples:
169169
---------------------
170170

171171
The ``partition_pptx`` partitioning brick pre-processes Microsoft PowerPoint documents
172-
saved in the ``.pptx`` format. This staging brick uses a combination of the styling
172+
saved in the ``.pptx`` format. This partition brick uses a combination of the styling
173173
information in the document and the structure of the text to determine the type
174174
of a text element. The ``partition_pptx`` can take a filename or file-like object
175175
as input, as shown in the two examples below.
@@ -190,7 +190,7 @@ Examples:
190190
---------------------
191191

192192
The ``partition_ppt`` partitioning brick pre-processes Microsoft PowerPoint documents
193-
saved in the ``.ppt`` format. This staging brick uses a combination of the styling
193+
saved in the ``.ppt`` format. This partition brick uses a combination of the styling
194194
information in the document and the structure of the text to determine the type
195195
of a text element. The ``partition_ppt`` can take a filename or file-like object.
196196
``partition_ppt`` uses ``libreoffice`` to convert the file to ``.pptx`` and then

docs/source/index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ Library Documentation
2020
:doc:`examples`
2121
Examples of other types of workflows within the ``unstructured`` package.
2222

23+
:doc:`integrations`
24+
We make it easy for you to connect your output with other popular ML services.
2325

2426
.. Hidden TOCs
2527
@@ -32,3 +34,4 @@ Library Documentation
3234
getting_started
3335
bricks
3436
examples
37+
integrations

docs/source/integrations.rst

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
Integrations
2+
======
3+
Integrate your model development pipeline with your favorite machine learning frameworks and libraries,
4+
and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of
5+
`staging bricks <https://unstructured-io.github.io/unstructured/bricks.html#staging>`_,
6+
which take a list of ``Element`` objects as input and return formatted dictionaries as output.
7+
8+
9+
``Integration with Argilla``
10+
--------------
11+
You can convert a list of ``Text`` elements to an `Argilla <https://www.argilla.io/>`_ ``Dataset`` using the `stage_for_argilla <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-argilla>`_ staging brick. Specify the type of dataset to be generated using the ``argilla_task`` parameter. Valid values are ``"text_classification"``, ``"token_classification"``, and ``"text2text"``. Follow the link for more details on usage.
12+
13+
14+
``Integration with Datasaur``
15+
--------------
16+
You can format a list of ``Text`` elements as input to token based tasks in `Datasaur <https://datasaur.ai/>`_ using the `stage_for_datasaur <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-datasaur>`_ staging brick. You will obtain a list of dictionaries indexed by the keys ``"text"`` with the content of the element, and ``"entities"`` with an empty list. Follow the link to learn how to customise your entities and for more details on usage.
17+
18+
19+
``Integration with Hugging Face``
20+
--------------
21+
You can prepare ``Text`` elements for processing in Hugging Face `Transformers <https://huggingface.co/docs/transformers/index>`_
22+
pipelines by splitting the elements into chunks that fit into the model's attention window using the `stage_for_transformers <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers>`_ staging brick. You can customise the transformation by defining
23+
the ``buffer`` and ``window_size``, the ``split_function`` and the ``chunk_separator``. if you need to operate on
24+
text directly instead of ``unstructured`` ``Text`` objects, use the `chunk_by_attention_window <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-transformers>`_ helper function. Follow the links for more details on usage.
25+
26+
27+
``Integration with Labelbox``
28+
--------------
29+
You can format your outputs for use with `LabelBox <https://labelbox.com/>`_ using the `stage_for_label_box <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-box>`_ staging brick. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the ``output_directory`` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type ``List[Dict[str, Any]]`` that can be written to a ``.json`` file and imported into LabelBox. Follow the link to see how to generate the ``config.json`` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage.
30+
31+
32+
``Integration with Label Studio``
33+
--------------
34+
You can format your outputs for upload to `Label Studio <https://labelstud.io/>`_ using the `stage_for_label_studio <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-label-studio>`_ staging brick. After running ``stage_for_label_studio``, you can write the results
35+
to a JSON folder that is ready to be included in a new Label Studio project. You can also include pre-annotations and predictions
36+
as part of your upload.
37+
38+
Check our example notebook to format and upload the risk section from an SEC filing to Label Studio for a sentiment analysis labeling task `here <https://unstructured-io.github.io/unstructured/examples.html#sentiment-analysis-labeling-in-labelstudio>`_ . Follow the link for more details on usage, and check `Label Studio docs <https://labelstud.io/tags/labels.html>`_ for a full list of options for labels and annotations.
39+
40+
41+
``Integration with LangChain``
42+
--------------
43+
Our integration with `LangChain <https://github.com/hwchase17/langchain>`_ makes it incredibly easy to combine language models with your data, no matter what form it is in. The `Unstructured.io File Loader <https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/unstructured_file.html>`_ extracts the text from a variety of unstructured text files using our ``unstructured`` library. It is designed to be used as a way to load data into `LlamaIndex <https://github.com/jerryjliu/llama_index>`_ and/or subsequently used as a Tool in a LangChain Agent. See `here <https://github.com/emptycrown/llama-hub/tree/main>`_ for more `LlamaHub <https://llamahub.ai/>`_ examples.
44+
45+
To use ``Unstructured.io File Loader`` you will need to have LlamaIndex 🦙 (GPT Index) installed in your environment. Just ``pip install llama-index`` and then pass in a ``Path`` to a local file. Optionally, you may specify split_documents if you want each element generated by ``unstructured`` to be placed in a separate document. Here is a simple example on how to use it:
46+
47+
.. code:: python
48+
from pathlib import Path
49+
from llama_index import download_loader
50+
51+
52+
UnstructuredReader = download_loader("UnstructuredReader")
53+
54+
loader = UnstructuredReader()
55+
documents = loader.load_data(file=Path('./10k_filing.html'))
56+
57+
58+
``Integration with Pandas``
59+
--------------
60+
You can convert a list of ``Element`` objects to a Pandas dataframe with columns for
61+
the text from each element and their types such as ``NarrativeText`` or ``Title`` using the `convert_to_dataframe <https://unstructured-io.github.io/unstructured/bricks.html#convert-to-dataframe>`_ staging brick. Follow the link for more details on usage.
62+
63+
64+
``Integration with Prodigy``
65+
--------------
66+
You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
67+
``stage_csv_for_prodigy``, you can write the results to a ``.json`` | ``.jsonl`` or a ``.csv`` file that is ready to be used with Prodigy. Follow the links for more details on usage.

0 commit comments

Comments
 (0)