Skip to content

Commit 257b8bf

Browse files
authored
[Hold] Platform: Remove Custom option from 'build it with me' workflow type selector (#449)
1 parent 7532a09 commit 257b8bf

File tree

2 files changed

+3
-166
lines changed

2 files changed

+3
-166
lines changed

platform/workflows.mdx

Lines changed: 2 additions & 165 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,6 @@ The Unstructured Platform provides two types of workflow builders:
2121
- [Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
2222
- [Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
2323

24-
All Unstructured accounts can create automatic worklows.
25-
26-
To create custom workflows, you must request Unstructured to enable your account first. [Learn how](#create-a-custom-worklow).
27-
2824
### Create an automatic workflow
2925

3026
<Warning>
@@ -151,7 +147,7 @@ To create an automatic workflow:
151147
</Accordion>
152148
</AccordionGroup>
153149

154-
9. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
150+
9. The **Reprocess all** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
155151

156152
- Checking this box reprocesses all documents in the source location on every workflow run.
157153
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
@@ -172,168 +168,9 @@ To create an automatic workflow:
172168
To see your existing connectors, on the sidebar, click **Connectors**, and then click **Sources** or **Destinations**.
173169
</Warning>
174170

175-
There are two ways to create a custom workflow:
176-
177-
- Through [Build it with me > Custom](#build-it-with-me-custom). This option enables you to fine-tune the kinds of settings that are in **Basic**, **Advanced**, and **Platinum**.
178-
- Through [Build it myself](#build-it-myself). This option offers a visual workflow designer with even more fine-tuning than the **Custom** option.
179-
180-
#### Build it with me - Custom
181-
182-
1. On the sidebar, click **Workflows**.
183-
2. Click **New Workflow**.
184-
3. Next to **Build it with me**, click **Create Workflow**.
185-
186-
<Note>If a radio button appears instead of **Build it with me**, select it, and then click **Continue**.</Note>
187-
188-
4. For **Workflow Name**, enter some unique name for this workflow.
189-
5. In the **Sources** dropdown list, select your source location.
190-
6. In the **Destinations** dropdown list, select your destination location.
191-
192-
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>
193-
194-
7. Click **Continue**.
195-
8. In the **Optimize for** section, click the **Custom** option, and then click **Continue**.
196-
197-
<Note>
198-
If the **Custom** option is disabled, inside the **Custom** option click **Notify me**, and follow the on-screen directions to complete the request.
199-
Unstructured will notify you when your account has been enabled with the **Custom** option. After you receive this notification, click the
200-
**Custom** option, and then click **Continue**.
201-
</Note>
202-
203-
9. In the **Strategy** area, choose one of the following:
204-
205-
- **Fast**: Ideal for simple, text-only documents.
206-
- **High Res**: Best for PDFs, images, and complex file types.
207-
208-
<Note>
209-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
210-
</Note>
211-
212-
- **VLM**: For your most challenging documents, including scanned and handwritten content.
213-
214-
You must also choose a VLM provider and model. Available choices include:
215-
216-
- **Anthropic**: **Claude 3.5 Sonnet**
217-
- **OpenAI**: **GPT-4o**
218-
219-
<Note>
220-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
221-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
222-
223-
When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
224-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
225-
</Note>
226-
227-
[Learn more](/platform/partitioning).
228-
229-
10. In the **Image Summzarizer** drop-down list, choose one of the following:
230-
231-
- **None**: Do not provide summaries for any detected images in any of the files.
232-
- **OpenAI GPT-4o Image Description**: Use GPT-4o to provide summaries for any detected images in any of the files. [Learn more](https://openai.com/index/hello-gpt-4o/).
233-
- **Claude 3.5 Sonnet Image Description**: Use Claude 3.5 Sonnet to provide summaries for any detected images in any of the files. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
234-
235-
[Learn more](/platform/summarizing).
236-
237-
11. In the **Table Summzarizer** drop-down list, choose one of the following:
238-
239-
- **None**: Do not provide summaries for any detected tables in any of the files.
240-
- **OpenAI GPT-4o Table Description**: Use GPT-4o to provide summaries for any detected tables in any of the files. [Learn more](https://openai.com/index/hello-gpt-4o/).
241-
- **OpenAI GPT-4o Table to HTML**: Use GPT-4o to convert any detected tables to HTML format. [Learn more](https://openai.com/index/hello-gpt-4o/).
242-
- **Claude 3.5 Sonnet Table Description**: Use Claude 3.5 Sonnet to provide summaries for any detected tables in any of the files. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
243-
244-
[Learn more](/platform/summarizing).
245-
246-
12. Check the **Include Page Breaks** box to include page breaks in the output, if the file type support it.
247-
13. Check the **Infer Table Structure** box to extract any detected table elements in PDF files as HTML format into a `metadata` output field named `text_as_html`.
248-
249-
14. In the **Elements to Exclude** drop-down list, select any element types to exclude from the output.
250-
15. In the **Chunk** area, for **Chunker Type**, select one of the following:
251-
252-
- **None**: Do not apply special chunking rules to the output.
253-
- **Chunk by Character** (also known as _basic_ chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
254-
255-
- **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
256-
- **Max Characters**: Cut off new sections after reaching a length of this many characters. The default is **2048**.
257-
- **New After n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500**.
258-
- **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**.
259-
- **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
260-
261-
- **Chunk by Page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
262-
263-
- **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
264-
- **Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
265-
- **New After n Characters**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **50**.
266-
- **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **30**.
267-
- **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
268-
269-
- **Chunk by Title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
270-
271-
- **Combine Text Under n Chars**: Combine elements until a section reaches a length of this many characters. The default is **0**.
272-
- **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
273-
- **Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **2048**.
274-
- **Multipage Sections**: Check this box to allow sections to span multiple pages. By default, this box is unchecked.
275-
- **New After n Characters**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*.
276-
- **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**.
277-
- **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked.
278-
279-
- **Chunk by Similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
280-
281-
- **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked.
282-
- **Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**.
283-
- **Similarity Threshold**: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is **0.5**. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
284-
285-
Learn more:
286-
287-
- [Chunking overview](/platform/chunking)
288-
- [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices)
289-
290-
16. In the **Embed** area, for **Provider**, choose one of the following:
291-
292-
- **None**: Do not generate embeddings.
293-
- **Azure OpenAI**: Use Azure OpenAI to generate embeddings. Also, choose the model to use:
294-
295-
- **text-embedding-3-small**, with 1536 dimensions.
296-
- **text-embedding-3-large**, with 3072 dimensions.
297-
- **Ada 002 (Text)** (`text-embedding-ada-002`), with 1536 dimensions.
298-
299-
[Learn more](https://learn.microsoft.com/azure/ai-services/openai/concepts/models#embeddings).
300-
301-
- **TogetherAI**: Use TogetherAI to generate embeddings. Also, choose the model to use:
302-
303-
- **M2-BERT-80M-2K-Retrieval**, with 768 dimensions.
304-
- **M2-BERT-80M-8K-Retrieval**, with 768 dimensions.
305-
- **M2-BERT-80M-32K-Retrieval**, with 768 dimensions.
306-
307-
[Learn more](https://docs.together.ai/docs/serverless-models#embedding-models).
308-
309-
Learn more:
310-
311-
- [Embedding overview](/platform/embedding)
312-
- [Understanding embedding models: make an informed choice for your RAG](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag).
313-
314-
17. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
315-
316-
- Checking this box reprocesses all documents in the source location on every workflow run.
317-
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
318-
319-
18. Check the **Retry Failed Documents** box if you want to retry processing any documents that failed to process.
320-
19. Click **Continue**.
321-
20. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
322-
21. Click **Complete**.
323-
22. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
324-
325-
#### Build it myself
326-
327171
1. On the sidebar, click **Workflows**.
328172
2. Click **New Workflow**.
329173
3. Click the **Build it myself** option, and then click **Continue**.
330-
331-
<Note>
332-
If the **Build it myself** option is disabled, inside the **Build it myself** option click **Notify me**, and follow the on-screen directions to complete the request.
333-
Unstructured will notify you when your account has been enabled with the **Build it myself** option. After you receive this notification, click the
334-
**Build it myself** option, and then click **Continue**.
335-
</Note>
336-
337174
4. In the **This workflow** pane, click the **Details** button.
338175

339176
![Workflow details](/img/platform/Workflow-Details.png)
@@ -342,7 +179,7 @@ There are two ways to create a custom workflow:
342179
6. If you want this workflow to run on a schedule, click the **Schedule** button. In the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings.
343180
7. To overwrite any previously processed files, or to retry any documents that fail to process, click the **Settings** button, and check either or both of the boxes.
344181

345-
The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
182+
The **Reprocess all** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
346183

347184
- Checking this box reprocesses all documents in the source location on every workflow run.
348185
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.

snippets/quickstarts/platform.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ allowfullscreen
104104
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
105105
</Note>
106106

107-
9. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors:
107+
9. The **Reprocess all** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
108108

109109
- Checking this box reprocesses all documents in the source location on every workflow run.
110110
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.

0 commit comments

Comments
 (0)