You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: api-reference/workflow/overview.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -222,7 +222,7 @@ The following Unstructured SDKs, tools, and libraries do _not_ work with the Uns
222
222
- The [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts)
223
223
- [Local single-file POST requests](/api-reference/partition/sdk-jsts) to the Unstructured Partition Endpoint
224
224
- The [Unstructured open source Python library](/open-source/introduction/overview)
225
-
- The [Unstructued Ingest CLI](/ingestion/ingest-cli)
225
+
- The [Unstructured Ingest CLI](/ingestion/ingest-cli)
226
226
- The [Unstructured Ingest Python library](/ingestion/python-ingest)
227
227
228
228
The following Unstructured API URL is also _not_ supported: `https://api.unstructuredapp.io/general/v0/general` (the Unstructured Partition Endpoint URL).
Copy file name to clipboardExpand all lines: ingestion/ingest-cli.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ You can use the Unstructured Ingest CLI to process files locally, or you can use
19
19
20
20
Local processing does not use an Unstructured API key or API URL.
21
21
22
-
Using the Ingest CLI to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
22
+
Using the Ingest CLI to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows:
Copy file name to clipboardExpand all lines: ingestion/overview.mdx
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,10 +3,10 @@ title: Overview
3
3
---
4
4
5
5
<Note>
6
-
Unstructured recommends that you use the [Unstructured API](/api-reference/overview) instead of the
6
+
Unstructured recommends that you use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead of the
7
7
Unstructured Ingest CLI or the Unstructured Ingest Python library.
8
8
9
-
The Unstructured API provides a full range of partitioning, chunking, embedding, and enrichment options for your files and data.
9
+
The Unstructured UI and API provide a full range of partitioning, chunking, embedding, and enrichment options for your files and data.
10
10
It also uses the latest and highest-performing models on the market today, and it has built-in logic to deliver the highest quality results
Copy file name to clipboardExpand all lines: ingestion/python-ingest.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ You can use the Unstructured Ingest Python library to process files locally, or
31
31
32
32
Local processing does not use an Unstructured API key or API URL.
33
33
34
-
Using the Ingest Python library to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
34
+
Using the Ingest Python library to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows:
Copy file name to clipboardExpand all lines: open-source/core-functionality/staging.mdx
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,10 +3,7 @@ title: Staging
3
3
---
4
4
5
5
<Warning>
6
-
7
-
The `Staging` brick is being deprecated in favor of the new and more comprehensive `Destination Connectors`. To explore the complete list and usage, please refer to [Destination Connectors documentation](/ingestion/destination-connectors/overview).
8
-
9
-
Note: We are constantly expanding our collection of destination connectors. If you wish to request a specific Destination Connector, you’re encouraged to submit a Feature Request on the [Unstructured GitHub repository](https://github.com/Unstructured-IO/unstructured/issues/new/choose).
6
+
Staging functions in the Unstructured open source library are being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview).
10
7
</Warning>
11
8
12
9
Staging functions in the `unstructured` package help prepare your data for ingestion into downstream systems. A staging function accepts a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using `the stage_for_label_studio` function. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task.
Copy file name to clipboardExpand all lines: open-source/introduction/overview.mdx
+12-13Lines changed: 12 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,48 +3,47 @@ title: Unstructured Open Source
3
3
sidebarTitle: Overview
4
4
---
5
5
6
-
<Note>The `unstructured` open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, see the [Unstructured API](/api-reference/overview) instead.</Note>
6
+
<Note>The Unstructured open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead.</Note>
7
7
8
-
The `unstructured`[library](https://github.com/Unstructured-IO/unstructured) offers an open-source toolkit
8
+
<Tip>To start using the Unstructured open source library right away, skip ahead to the [quickstart](/open-source/introduction/quick-start).</Tip>
9
+
10
+
The Unstructured open source library ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) offers an open-source toolkit
9
11
designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents
10
12
such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs),
11
-
`unstructured` provides modular functions and connectors that work seamlessly together. This cohesive system ensures
13
+
the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures
12
14
efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms
13
15
and use cases.
14
16
15
17
## Key functionality
16
18
17
-
***Precise Document Extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements).
19
+
***Precise document extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements).
18
20
19
-
***Extensive File Support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types).
21
+
***Robust file support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types).
20
22
21
-
***Robust Core Functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
23
+
***Robust core functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
22
24
23
25
*[Partitioning](/open-source/core-functionality/partitioning): The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.
24
26
25
27
*[Cleaning](/open-source/core-functionality/cleaning): Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.
26
28
27
29
*[Extracting](/open-source/core-functionality/extracting): This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.
28
30
29
-
*[Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of `Destination Connectors`.
30
-
31
+
*[Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview).
32
+
31
33
*[Chunking](/open-source/core-functionality/chunking): The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).
32
-
33
-
***High-performant Connectors**: The platform includes optimized connectors for efficient data ingestion and output. These comprise [Source Connectors](/ingestion/source-connectors/overview) for data input and [Destination Connectors](/ingestion/destination-connectors/overview) for data export.
34
34
35
-
36
35
## Common use cases
37
36
38
37
* Pretraining models
39
38
* Fine-tuning models
40
39
* Retrieval Augmented Generation (RAG)
41
40
* Traditional ETL
42
41
43
-
<Note>We do not support GPU usage with the open source library.</Note>
42
+
<Note>GPU usage is not supported for the Unstructured open source library.</Note>
44
43
45
44
## Limits
46
45
47
-
The open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview):
46
+
The Unstructured open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview):
48
47
49
48
* Not designed for production scenarios.
50
49
* Significantly decreased performance on document and table extraction.
0 commit comments