You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: platform/workflows.mdx
+2-165Lines changed: 2 additions & 165 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,10 +21,6 @@ The Unstructured Platform provides two types of workflow builders:
21
21
-[Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
22
22
-[Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
23
23
24
-
All Unstructured accounts can create automatic worklows.
25
-
26
-
To create custom workflows, you must request Unstructured to enable your account first. [Learn how](#create-a-custom-worklow).
27
-
28
24
### Create an automatic workflow
29
25
30
26
<Warning>
@@ -151,7 +147,7 @@ To create an automatic workflow:
151
147
</Accordion>
152
148
</AccordionGroup>
153
149
154
-
9. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
150
+
9. The **Reprocess all** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
155
151
156
152
- Checking this box reprocesses all documents in the source location on every workflow run.
157
153
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
@@ -172,168 +168,9 @@ To create an automatic workflow:
172
168
To see your existing connectors, on the sidebar, click **Connectors**, and then click **Sources** or **Destinations**.
173
169
</Warning>
174
170
175
-
There are two ways to create a custom workflow:
176
-
177
-
- Through [Build it with me > Custom](#build-it-with-me-custom). This option enables you to fine-tune the kinds of settings that are in **Basic**, **Advanced**, and **Platinum**.
178
-
- Through [Build it myself](#build-it-myself). This option offers a visual workflow designer with even more fine-tuning than the **Custom** option.
179
-
180
-
#### Build it with me - Custom
181
-
182
-
1. On the sidebar, click **Workflows**.
183
-
2. Click **New Workflow**.
184
-
3. Next to **Build it with me**, click **Create Workflow**.
185
-
186
-
<Note>If a radio button appears instead of **Build it with me**, select it, and then click **Continue**.</Note>
187
-
188
-
4. For **Workflow Name**, enter some unique name for this workflow.
189
-
5. In the **Sources** dropdown list, select your source location.
190
-
6. In the **Destinations** dropdown list, select your destination location.
191
-
192
-
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>
193
-
194
-
7. Click **Continue**.
195
-
8. In the **Optimize for** section, click the **Custom** option, and then click **Continue**.
196
-
197
-
<Note>
198
-
If the **Custom** option is disabled, inside the **Custom** option click **Notify me**, and follow the on-screen directions to complete the request.
199
-
Unstructured will notify you when your account has been enabled with the **Custom** option. After you receive this notification, click the
200
-
**Custom** option, and then click **Continue**.
201
-
</Note>
202
-
203
-
9. In the **Strategy** area, choose one of the following:
204
-
205
-
-**Fast**: Ideal for simple, text-only documents.
206
-
-**High Res**: Best for PDFs, images, and complex file types.
207
-
208
-
<Note>
209
-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
210
-
</Note>
211
-
212
-
-**VLM**: For your most challenging documents, including scanned and handwritten content.
213
-
214
-
You must also choose a VLM provider and model. Available choices include:
215
-
216
-
-**Anthropic**: **Claude 3.5 Sonnet**
217
-
-**OpenAI**: **GPT-4o**
218
-
219
-
<Note>
220
-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
221
-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
222
-
223
-
When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
224
-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
225
-
</Note>
226
-
227
-
[Learn more](/platform/partitioning).
228
-
229
-
10. In the **Image Summzarizer** drop-down list, choose one of the following:
230
-
231
-
-**None**: Do not provide summaries for any detected images in any of the files.
232
-
-**OpenAI GPT-4o Image Description**: Use GPT-4o to provide summaries for any detected images in any of the files. [Learn more](https://openai.com/index/hello-gpt-4o/).
233
-
-**Claude 3.5 Sonnet Image Description**: Use Claude 3.5 Sonnet to provide summaries for any detected images in any of the files. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
234
-
235
-
[Learn more](/platform/summarizing).
236
-
237
-
11. In the **Table Summzarizer** drop-down list, choose one of the following:
238
-
239
-
-**None**: Do not provide summaries for any detected tables in any of the files.
240
-
-**OpenAI GPT-4o Table Description**: Use GPT-4o to provide summaries for any detected tables in any of the files. [Learn more](https://openai.com/index/hello-gpt-4o/).
241
-
-**OpenAI GPT-4o Table to HTML**: Use GPT-4o to convert any detected tables to HTML format. [Learn more](https://openai.com/index/hello-gpt-4o/).
242
-
-**Claude 3.5 Sonnet Table Description**: Use Claude 3.5 Sonnet to provide summaries for any detected tables in any of the files. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
243
-
244
-
[Learn more](/platform/summarizing).
245
-
246
-
12. Check the **Include Page Breaks** box to include page breaks in the output, if the file type support it.
247
-
13. Check the **Infer Table Structure** box to extract any detected table elements in PDF files as HTML format into a `metadata` output field named `text_as_html`.
248
-
249
-
14. In the **Elements to Exclude** drop-down list, select any element types to exclude from the output.
250
-
15. In the **Chunk** area, for **Chunker Type**, select one of the following:
251
-
252
-
-**None**: Do not apply special chunking rules to the output.
253
-
-**Chunk by Character** (also known as _basic_ chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
254
-
255
-
-**Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
256
-
-**Max Characters**: Cut off new sections after reaching a length of this many characters. The default is **2048**.
257
-
-**New After n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500**.
258
-
-**Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**.
259
-
-**Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
260
-
261
-
-**Chunk by Page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
262
-
263
-
-**Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
264
-
-**Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
265
-
-**New After n Characters**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **50**.
266
-
-**Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **30**.
267
-
-**Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
268
-
269
-
-**Chunk by Title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
270
-
271
-
-**Combine Text Under n Chars**: Combine elements until a section reaches a length of this many characters. The default is **0**.
272
-
-**Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
273
-
-**Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **2048**.
274
-
-**Multipage Sections**: Check this box to allow sections to span multiple pages. By default, this box is unchecked.
275
-
-**New After n Characters**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*.
276
-
-**Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**.
277
-
-**Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
278
-
279
-
-**Chunk by Similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
280
-
281
-
-**Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
282
-
-**Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
283
-
-**Similarity Threshold**: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is **0.5**. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
284
-
285
-
Learn more:
286
-
287
-
-[Chunking overview](/platform/chunking)
288
-
-[Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices)
289
-
290
-
16. In the **Embed** area, for **Provider**, choose one of the following:
291
-
292
-
-**None**: Do not generate embeddings.
293
-
-**Azure OpenAI**: Use Azure OpenAI to generate embeddings. Also, choose the model to use:
294
-
295
-
-**text-embedding-3-small**, with 1536 dimensions.
296
-
-**text-embedding-3-large**, with 3072 dimensions.
297
-
-**Ada 002 (Text)** (`text-embedding-ada-002`), with 1536 dimensions.
-[Understanding embedding models: make an informed choice for your RAG](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag).
313
-
314
-
17. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
315
-
316
-
- Checking this box reprocesses all documents in the source location on every workflow run.
317
-
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
318
-
319
-
18. Check the **Retry Failed Documents** box if you want to retry processing any documents that failed to process.
320
-
19. Click **Continue**.
321
-
20. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
322
-
21. Click **Complete**.
323
-
22. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
324
-
325
-
#### Build it myself
326
-
327
171
1. On the sidebar, click **Workflows**.
328
172
2. Click **New Workflow**.
329
173
3. Click the **Build it myself** option, and then click **Continue**.
330
-
331
-
<Note>
332
-
If the **Build it myself** option is disabled, inside the **Build it myself** option click **Notify me**, and follow the on-screen directions to complete the request.
333
-
Unstructured will notify you when your account has been enabled with the **Build it myself** option. After you receive this notification, click the
334
-
**Build it myself** option, and then click **Continue**.
335
-
</Note>
336
-
337
174
4. In the **This workflow** pane, click the **Details** button.
@@ -342,7 +179,7 @@ There are two ways to create a custom workflow:
342
179
6. If you want this workflow to run on a schedule, click the **Schedule** button. In the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings.
343
180
7. To overwrite any previously processed files, or to retry any documents that fail to process, click the **Settings** button, and check either or both of the boxes.
344
181
345
-
The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
182
+
The **Reprocess all** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
346
183
347
184
- Checking this box reprocesses all documents in the source location on every workflow run.
348
185
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
Copy file name to clipboardExpand all lines: snippets/quickstarts/platform.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -104,7 +104,7 @@ allowfullscreen
104
104
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
105
105
</Note>
106
106
107
-
9. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
107
+
9. The **Reprocess all** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
108
108
109
109
- Checking this box reprocesses all documents in the source location on every workflow run.
110
110
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
0 commit comments