Skip to content

Commit b3c60da

Browse files
authored
Platform UI: latest list of available connectors, workflow settings (#255)
1 parent 81a6723 commit b3c60da

File tree

8 files changed

+45
-91
lines changed

8 files changed

+45
-91
lines changed

mint.json

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -432,29 +432,16 @@
432432
"pages": [
433433
"platform/sources/overview",
434434
"platform/sources/azure-blob-storage",
435-
"platform/sources/elasticsearch",
436-
"platform/sources/google-drive",
437-
"platform/sources/onedrive-cloud-storage",
438-
"platform/sources/opensearch",
439-
"platform/sources/s3",
440-
"platform/sources/salesforce",
441-
"platform/sources/sftp-storage",
442-
"platform/sources/sharepoint"
435+
"platform/sources/s3"
443436
]
444437
},
445438
{
446439
"group": "Destinations",
447440
"pages": [
448441
"platform/destinations/overview",
449442
"platform/destinations/azure-cognitive-search",
450-
"platform/destinations/chroma",
451-
"platform/destinations/databricks",
452-
"platform/destinations/elasticsearch",
453-
"platform/destinations/mongodb",
454-
"platform/destinations/opensearch",
455443
"platform/destinations/pinecone",
456-
"platform/destinations/s3",
457-
"platform/destinations/weaviate"
444+
"platform/destinations/s3"
458445
]
459446
},
460447
"platform/workflows",

platform/connectors.mdx

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,7 @@ The Unstructured Platform supports connecting to the following source and destin
1212
## Sources
1313

1414
- [Azure](/platform/sources/azure-blob-storage)
15-
- [Elasticsearch](/platform/sources/elasticsearch)
16-
- [Google Drive](/platform/sources/google-drive)
17-
- [OneDrive](/platform/sources/onedrive-cloud-storage)
18-
- [OpenSearch](/platform/sources/opensearch)
1915
- [S3](/platform/sources/s3)
20-
- [Salesforce](/platform/sources/salesforce)
21-
- [SFTP](/platform/sources/sftp-storage)
22-
- [SharePoint](/platform/sources/sharepoint)
2316

2417
If your source is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the
2518
[Unstructured Ingest CLI](/ingestion/overview#unstructured-ingest-cli) or the
@@ -29,14 +22,8 @@ If your source is not listed here, you might still be able to connect Unstructur
2922
## Destinations
3023

3124
- [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
32-
- [Chroma](/platform/destinations/chroma)
33-
- [Databricks Volumes](/platform/destinations/databricks)
34-
- [Elasticsearch](/platform/destinations/elasticsearch)
35-
- [MongoDB](/platform/destinations/mongodb)
36-
- [OpenSearch](/platform/destinations/opensearch)
3725
- [Pinecone](/platform/destinations/pinecone)
3826
- [S3](/platform/destinations/s3)
39-
- [Weaviate](/platform/destinations/weaviate)
4027

4128
If your destination is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the
4229
[Unstructured Ingest CLI](/ingestion/overview#unstructured-ingest-cli) or the

platform/destinations/overview.mdx

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,8 @@ To create a destination connector:
1515
4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:
1616

1717
- [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
18-
- [Chroma](/platform/destinations/chroma)
19-
- [Databricks Volumes](/platform/destinations/databricks)
20-
- [Elasticsearch](/platform/destinations/elasticsearch)
21-
- [MongoDB](/platform/destinations/mongodb)
22-
- [OpenSearch](/platform/destinations/opensearch)
2318
- [Pinecone](/platform/destinations/pinecone)
2419
- [S3](/platform/destinations/s3)
25-
- [Weaviate](/platform/destinations/weaviate)
2620

2721
5. Click **Save and Test**.
2822
6. Click **Close**.

platform/overview.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ To get your data RAG-ready, the Unstructured Platform moves it through the follo
3535
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
3636

3737
- **Fast** is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format.
38-
- **Hi-Res** is best for PDFs and tables and where accurate classification of document elements is critical.
38+
- **Hi Res** is best for PDFs and tables and where accurate classification of document elements is critical.
3939
- If you're unsure which strategy to use, choose **Auto**, and the Unstructured Platform will handle the decision for you.
4040

4141
</Step>

platform/partitioning.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ To choose one of these strategies, select one of the **Strategy** options in the
2525
- You have only PDF files, and you know that none of them have embedded images or tables in them, or
2626
- You have no PDF files or image files at all.
2727

28-
- **Hi-Res**: This strategy uses an image-to-text model for inference. It is slower and costlier than **Fast** but can provide
28+
- **Hi Res**: This strategy uses an image-to-text model for inference. It is slower and costlier than **Fast** but can provide
2929
higher-quality resolution. You should choose this strategy if you know that:
3030

3131
- All of the files are only image files, or

platform/sources/overview.mdx

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,7 @@ To create a source connector:
1616
4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:
1717

1818
- [Azure](/platform/sources/azure-blob-storage)
19-
- [Elasticsearch](/platform/sources/elasticsearch)
20-
- [Google Drive](/platform/sources/google-drive)
21-
- [OneDrive](/platform/sources/onedrive-cloud-storage)
22-
- [OpenSearch](/platform/sources/opensearch)
2319
- [S3](/platform/sources/s3)
24-
- [Salesforce](/platform/sources/salesforce)
25-
- [SFTP](/platform/sources/sftp-storage)
26-
- [SharePoint](/platform/sources/sharepoint)
2720

2821
5. Click **Save and Test**.
2922
6. Click **Close**.

platform/workflows.mdx

Lines changed: 39 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,6 @@ Workflows are crucial for establishing a systematic approach to managing data fl
1414

1515
## Create a workflow
1616

17-
![Create workflow](/img/platform/Create-Workflow.png)
18-
1917
<Warning>
2018
You must first have an existing source connector and destination connector to add to the workflow.
2119

@@ -36,8 +34,8 @@ To create a workflow:
3634

3735
6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups:
3836

39-
- **Basic** is a good choice if you have documents that have no images or tables in them.
40-
- **Advanced** is a good choice if you have documents that have images or tables or both in them.
37+
- **Basic** is a good choice if you have text-only documents that have no images or tables in them.
38+
- **Advanced** is a good choice if you have complex documents that have images or tables or both in them.
4139

4240
Learn about the predefined settings for [Basic](#basic-workflow-settings) and [Advanced](#advanced-workflow-settings).
4341

@@ -63,26 +61,28 @@ To learn more about these settings, see the descriptions for [Custom workflow se
6361
- **Transform** section:
6462

6563
- [Strategy](/platform/partitioning): **Fast**
66-
- [Image Summarization](/platform/summarizing): **None**
67-
- [Table Summarization](/platform/summarizing): **None**
64+
- [Image summarization](/platform/summarizing): **None**
65+
- [Table summarization](/platform/summarizing): **None**
6866
- **Connector Settings**:
6967

7068
- **Include Page Breaks**: No (unchecked)
71-
- **Infer Table Structure**: No (unchecked)
69+
- **Infer Table Structure**: Yes (checked)
70+
71+
- **Elements to Exclude**: None (nothing selected)
7272

7373
- [Chunk](/platform/chunking) section:
7474

7575
- **Chunker Type**: **Basic**
7676
- **Include Original Elements**: No (unchecked)
77-
- **Max Characters**: **500**
78-
- **New After N Characters**: **1000**
79-
- **Overlap**: **100**
77+
- **Max Characters**: **2048**
78+
- **New After N Characters**: **1500**
79+
- **Overlap**: **160**
8080
- **Overlap All**: No (unchecked)
8181

8282
- [Embed](/platform/embedding) section:
8383

8484
- **Vendor**: **OpenAI**
85-
- **Embedding Model**: [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings)
85+
- **Embedding Model**: [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) (1536 dimensions)
8686

8787
## Advanced workflow settings
8888

@@ -92,31 +92,31 @@ To learn more about these settings, see the descriptions for [Custom workflow se
9292

9393
- **Transform** section:
9494

95-
- [Strategy](/platform/partitioning): **Hi-Res**
96-
- [Image Summarization](/platform/summarizing): **Claude 3.5 Sonnet**
97-
- [Table Summarization](/platform/summarizing): **GPT-4o**
95+
- [Strategy](/platform/partitioning): **Hi Res**
96+
- [Image summarization](/platform/summarizing): **Claude 3.5 Sonnet**
97+
- [Table summarization](/platform/summarizing): **GPT-4o**
9898
- **Connector Settings**:
9999

100-
- **Include Page Breaks**: Yes (checked)
101-
- **Infer Table Structure**: Yes (checked)
100+
- **Include Page Breaks**: No (unchecked)
101+
- **Infer Table Structure**: No (unchecked)
102102

103-
- **Elements to Exclude**: None
103+
- **Elements to Exclude**: None (nothing selected)
104104

105105
- [Chunk](/platform/chunking) section:
106106

107-
- **Chunker Type**: **By Title**
108-
- **Combine Text Under N Characters**: **1000**
109-
- **Include Original Elements**: Yes (checked)
110-
- **Max Characters**: **500**
107+
- **Chunker Type**: **Chunk By Title**
108+
- **Combine Text Under N Characters**: **0**
109+
- **Include Original Elements**: No (unchecked)
110+
- **Max Characters**: **2048**
111111
- **Multipage Sections**: Yes (checked)
112-
- **New After N Characters**: **1000**
113-
- **Overlap**: **100**
114-
- **Overlap All**: Yes (checked)
112+
- **New After N Characters**: **1500**
113+
- **Overlap**: **160**
114+
- **Overlap All**: No (unchecked)
115115

116116
- [Embed](/platform/embedding) section:
117117

118118
- **Vendor**: **OpenAI**
119-
- **Embedding Model**: [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings)
119+
- **Embedding Model**: [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings) (3072 dimensions)
120120

121121
## Custom workflow settings
122122

@@ -130,7 +130,7 @@ The following workflow settings can be customized:
130130
1. For **Strategy**, choose one of the following:
131131

132132
- **Fast**: This strategy uses traditional NLP extraction techniques to quickly pull in all text elements. This strategy is not good for image-based file types. [Learn more](/platform/partitioning).
133-
- **Hi-Res**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning).
133+
- **Hi Res**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning).
134134
- **Auto**: This strategy chooses the partitioning strategy based on detected document characteristics. [Learn more](/platform/partitioning).
135135

136136
2. For **Image summarization**, choose one of the following:
@@ -152,7 +152,7 @@ The following workflow settings can be customized:
152152
4. For **Connector Settings**, check one or more of the following boxes:
153153

154154
- **Include Page Breaks**: Include page breaks in the output, if the file type supports it.
155-
- **Infer Table Structure**: If you also set **Strategy** to **Hi-Res**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML `<table>`.
155+
- **Infer Table Structure**: If you also set **Strategy** to **Hi Res**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML `<table>`.
156156

157157
5. For **Elements to Exclude**, select one or more standard Unstructured element types to not include in the output. [Learn more](/platform/document-elements).
158158
</Accordion>
@@ -190,7 +190,7 @@ The following workflow settings can be customized:
190190

191191
- **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk.
192192
- **Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is a strict limit.
193-
- **Similarity Threshold** (_required_): Specify a threshold between 0 and 1, where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
193+
- **Similarity Threshold** (_required_): Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
194194

195195
[Learn more](https://unstructured.io/blog/chunking-for-rag-best-practices).
196196
</Accordion>
@@ -200,33 +200,26 @@ The following workflow settings can be customized:
200200
- **Off**: Do not generate embeddings.
201201
- **OpenAI**: Use OpenAI to generate embeddings. Also choose the embedding model to use, from one of the following:
202202

203-
- **text-embedding-3-small**: [Learn more](https://platform.openai.com/docs/guides/embeddings).
204-
- **text-embedding-3-large**: [Learn more](https://platform.openai.com/docs/guides/embeddings).
205-
- **Ada 002 (Text)**: [Learn more](https://platform.openai.com/docs/guides/embeddings).
203+
- **text-embedding-3-small** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings).
204+
- **text-embedding-3-large** (3072 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings).
205+
- **Ada 002 (Text)** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings).
206206

207-
- **Anthropic**: Use Anthropic to generate embeddings. Also choose the embedding model to use, from one of the following:
208-
209-
- **voyage-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
210-
- **voyage-large-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
211-
- **voyage-code-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
212-
- **voyage-lite-02-instruct**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
213-
214207
- **Hugging Face**: Use Hugging Face to generate embeddings. Also choose the embedding model to use, from one of the following:
215208

216-
- **nvidia/NV-Embed-v1**: [Learn more](https://huggingface.co/nvidia/NV-Embed-v1).
217-
- **voyage-large-2-instruct**: [Learn more](https://huggingface.co/voyageai/voyage-large-2-instruct).
218-
- **stella_en_400M_v5**: [Learn more](https://huggingface.co/dunzhang/stella_en_400M_v5).
219-
- **stella_en_1.5B_v5**: [Learn more](https://huggingface.co/dunzhang/stella_en_1.5B_v5).
220-
- **Alibaba-NLP/gte-Qwen2-7B-instruct**: [Learn more](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct).
209+
- **nvidia/NV-Embed-v1** (4096 dimensions): [Learn more](https://huggingface.co/nvidia/NV-Embed-v1).
210+
- **voyage-large-2-instruct** (1024 dimensions): [Learn more](https://huggingface.co/voyageai/voyage-large-2-instruct).
211+
- **stella_en_400M_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_400M_v5).
212+
- **stella_en_1.5B_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_1.5B_v5).
213+
- **Alibaba-NLP/gte-Qwen2-7B-instruct** (3584 dimensions): [Learn more](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct).
221214

222215
- **OctoAI**: Use OctoAI to generate embeddings. Also choose the embedding model to use, from one of the following:
223216

224-
- **GTE Large**: [Learn more](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/).
217+
- **GTE Large** (1024 dimensions): [Learn more](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/).
225218

226219
- **Vertex AI**: Use Vertex AI to generate embeddings. Also choose the embedding model to use, from one of the following:
227220

228-
- **textembedding-gecko@003**: [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
229-
- **text-embedding-004**: [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
221+
- **textembedding-gecko@003** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
222+
- **text-embedding-004** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
230223

231224
Learn more:
232225

snippets/quickstarts/platform.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,8 @@ You will need:
4747

4848
6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups:
4949

50-
- **Basic** is a good choice if you have documents that have no images or tables in them.
51-
- **Advanced** is a good choice if you have documents that have images or tables or both in them.
50+
- **Basic** is a good choice if you have text-only documents that have no images or tables in them.
51+
- **Advanced** is a good choice if you have complex documents that have images or tables or both in them.
5252

5353
Learn about the predefined settings for [Basic](/platform/workflows#basic-workflow-settings) and [Advanced](/platform/workflows#advanced-workflow-settings).
5454

0 commit comments

Comments
 (0)