Skip to content

Commit 969bf87

Browse files
authored
Open source library: update quickstart (#637)
1 parent 4cf062c commit 969bf87

File tree

7 files changed

+224
-91
lines changed

7 files changed

+224
-91
lines changed

api-reference/workflow/overview.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,7 @@ The following Unstructured SDKs, tools, and libraries do _not_ work with the Uns
222222
- The [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts)
223223
- [Local single-file POST requests](/api-reference/partition/sdk-jsts) to the Unstructured Partition Endpoint
224224
- The [Unstructured open source Python library](/open-source/introduction/overview)
225-
- The [Unstructued Ingest CLI](/ingestion/ingest-cli)
225+
- The [Unstructured Ingest CLI](/ingestion/ingest-cli)
226226
- The [Unstructured Ingest Python library](/ingestion/python-ingest)
227227
228228
The following Unstructured API URL is also _not_ supported: `https://api.unstructuredapp.io/general/v0/general` (the Unstructured Partition Endpoint URL).

ingestion/ingest-cli.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ You can use the Unstructured Ingest CLI to process files locally, or you can use
1919

2020
Local processing does not use an Unstructured API key or API URL.
2121

22-
Using the Ingest CLI to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
22+
Using the Ingest CLI to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows:
2323

2424
<GetStartedSimpleAPIOnly />
2525

ingestion/overview.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ title: Overview
33
---
44

55
<Note>
6-
Unstructured recommends that you use the [Unstructured API](/api-reference/overview) instead of the
6+
Unstructured recommends that you use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead of the
77
Unstructured Ingest CLI or the Unstructured Ingest Python library.
88

9-
The Unstructured API provides a full range of partitioning, chunking, embedding, and enrichment options for your files and data.
9+
The Unstructured UI and API provide a full range of partitioning, chunking, embedding, and enrichment options for your files and data.
1010
It also uses the latest and highest-performing models on the market today, and it has built-in logic to deliver the highest quality results
1111
at the lowest cost.
1212

ingestion/python-ingest.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ You can use the Unstructured Ingest Python library to process files locally, or
3131

3232
Local processing does not use an Unstructured API key or API URL.
3333

34-
Using the Ingest Python library to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
34+
Using the Ingest Python library to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows:
3535

3636
<GetStartedSimpleAPIOnly />
3737

open-source/core-functionality/staging.mdx

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,7 @@ title: Staging
33
---
44

55
<Warning>
6-
7-
The `Staging` brick is being deprecated in favor of the new and more comprehensive `Destination Connectors`. To explore the complete list and usage, please refer to [Destination Connectors documentation](/ingestion/destination-connectors/overview).
8-
9-
Note: We are constantly expanding our collection of destination connectors. If you wish to request a specific Destination Connector, you’re encouraged to submit a Feature Request on the [Unstructured GitHub repository](https://github.com/Unstructured-IO/unstructured/issues/new/choose).
6+
Staging functions in the Unstructured open source library are being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview).
107
</Warning>
118

129
Staging functions in the `unstructured` package help prepare your data for ingestion into downstream systems. A staging function accepts a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using `the stage_for_label_studio` function. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task.

open-source/introduction/overview.mdx

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,48 +3,47 @@ title: Unstructured Open Source
33
sidebarTitle: Overview
44
---
55

6-
<Note>The `unstructured` open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, see the [Unstructured API](/api-reference/overview) instead.</Note>
6+
<Note>The Unstructured open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead.</Note>
77

8-
The `unstructured` [library](https://github.com/Unstructured-IO/unstructured) offers an open-source toolkit
8+
<Tip>To start using the Unstructured open source library right away, skip ahead to the [quickstart](/open-source/introduction/quick-start).</Tip>
9+
10+
The Unstructured open source library ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) offers an open-source toolkit
911
designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents
1012
such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs),
11-
`unstructured` provides modular functions and connectors that work seamlessly together. This cohesive system ensures
13+
the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures
1214
efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms
1315
and use cases.
1416

1517
## Key functionality
1618

17-
* **Precise Document Extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements).
19+
* **Precise document extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements).
1820

19-
* **Extensive File Support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types).
21+
* **Robust file support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types).
2022

21-
* **Robust Core Functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
23+
* **Robust core functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
2224

2325
* [Partitioning](/open-source/core-functionality/partitioning): The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.
2426

2527
* [Cleaning](/open-source/core-functionality/cleaning): Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.
2628

2729
* [Extracting](/open-source/core-functionality/extracting): This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.
2830

29-
* [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of `Destination Connectors`.
30-
31+
* [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview).
32+
3133
* [Chunking](/open-source/core-functionality/chunking): The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).
32-
33-
* **High-performant Connectors**: The platform includes optimized connectors for efficient data ingestion and output. These comprise [Source Connectors](/ingestion/source-connectors/overview) for data input and [Destination Connectors](/ingestion/destination-connectors/overview) for data export.
3434

35-
3635
## Common use cases
3736

3837
* Pretraining models
3938
* Fine-tuning models
4039
* Retrieval Augmented Generation (RAG)
4140
* Traditional ETL
4241

43-
<Note>We do not support GPU usage with the open source library.</Note>
42+
<Note>GPU usage is not supported for the Unstructured open source library.</Note>
4443

4544
## Limits
4645

47-
The open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview):
46+
The Unstructured open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview):
4847

4948
* Not designed for production scenarios.
5049
* Significantly decreased performance on document and table extraction.

0 commit comments

Comments
 (0)