Skip to content

Commit c14621b

Browse files
authored
Platform: Dynamic pipelines v1 (#475)
1 parent 65ec592 commit c14621b

File tree

5 files changed

+84
-120
lines changed

5 files changed

+84
-120
lines changed

img/platform/Job-Complete.png

-15.9 KB
Loading

platform/overview.mdx

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,20 +36,26 @@ flowchart LR
3636
<Step title="Route">
3737
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
3838

39-
- **Basic** / **Fast** is ideal for simple, text-only documents.
40-
- **Advanced** / **High Res** is best for PDFs, images, and complex file types.
39+
- **Fast** is ideal for simple, text-only documents.
40+
- **High Res** is best for PDFs, images, and complex file types.
4141

4242
<Note>
43-
During **Advanced** / **High Res** processing, any detected text-based files are processed and billed at the **Basic** / **Fast** rate instead.
43+
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
4444
</Note>
4545

46-
- **Platinum** / **VLM** is for challenging documents, including scanned and handwritten content.
46+
- **VLM** is for challenging documents, including scanned and handwritten content.
4747

4848
<Note>
49-
During **Platinum** / **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** / **High Res** or **Basic** / **Fast** rate instead.
50-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** / **Fast** rate instead. The other files are processed and billed at the **Advanced** / **High Res** rate instead.
49+
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
50+
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
5151
</Note>
5252

53+
- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
54+
55+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
56+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
57+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
58+
5359
</Step>
5460
<Step title="Transform">
5561
Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.

platform/partitioning.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,9 @@ To choose one of these strategies, select one of the **Partition Strategy** opti
3636
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
3737
</Note>
3838

39+
- **Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
40+
41+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
42+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
43+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
44+

platform/workflows.mdx

Lines changed: 59 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,10 @@ Workflows are crucial for establishing a systematic approach to managing data fl
1414

1515
## Create a workflow
1616

17-
![Choose a workflow type](/img/platform/Choose-Workflow-Type.png)
18-
1917
The Unstructured Platform provides two types of workflow builders:
2018

21-
- [Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
22-
- [Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
19+
- [Automatic](#create-an-automatic-workflow) or **Build it For Me** workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
20+
- [Custom](#create-a-custom-worklow) or **Build it Myself** workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
2321

2422
### Create an automatic workflow
2523

@@ -35,9 +33,9 @@ To create an automatic workflow:
3533

3634
1. On the sidebar, click **Workflows**.
3735
2. Click **New Workflow**.
38-
3. Next to **Build it with Me**, click **Create Workflow**.
36+
3. Next to **Build it for Me**, click **Create Workflow**.
3937

40-
<Note>If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**.</Note>
38+
<Note>If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**.</Note>
4139

4240
4. For **Workflow Name**, enter some unique name for this workflow.
4341
5. In the **Sources** dropdown list, select your source location.
@@ -46,118 +44,78 @@ To create an automatic workflow:
4644
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>
4745

4846
7. Click **Continue**.
49-
8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups. Expand any or all
50-
of the following options to learn more about these preconfigured settings:
51-
52-
<AccordionGroup>
53-
<Accordion title="Basic">
54-
This option is ideal for simple, text-only documents.
55-
56-
The **Basic** option uses the following preconfigured workflow settings:
57-
58-
- **Strategy**: Fast
59-
- **Image Summarizer**: None
60-
- **Table Summarizer**: None
61-
- **Include Page Breaks**: No
62-
- **Infer Table Structure**: No
63-
- **Elements to Exclude**: None
64-
- **Chunk**:
47+
8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
6548

66-
- **Chunker Type**: Chunk By Character
67-
- **Chunk Options**:
68-
69-
- **Include Original Elements**: No
70-
- **Max Characters**: 2048
71-
- **New After N Characters**: 1500
72-
- **Overlap**: 160
73-
- **Overlap All**: No
74-
75-
- **Embed**:
49+
- Checking this box reprocesses all documents in the source location on every workflow run.
50+
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
7651

77-
- **Provider**: Azure OpenAI
78-
- **Model**: text-embedding-3-large (3072 dimensions)
52+
9. Click **Continue**.
53+
10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
54+
11. Click **Complete**.
7955

80-
</Accordion>
81-
<Accordion title="Advanced">
82-
This option is best for PDFs, images, and complex file types.
56+
By default, this workflow partitions, chunks, and generates embeddings as follows:
8357

84-
<Note>
85-
During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead.
86-
</Note>
58+
- **Partitioner**: **Auto** strategy
8759

88-
The **Advanced** option uses the following preconfigured workflow settings:
60+
Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
8961

90-
- **Strategy**: High-Res
91-
- **Image Summarizer**: GPT-4o
92-
- **Table Summarizer**: Claude 3.5 Sonnet
93-
- **Include Page Breaks**: No
94-
- **Infer Table Structure**: No
95-
- **Elements to Exclude**: None
96-
- **Chunk**:
62+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
63+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
64+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
9765

98-
- **Chunker Type**: Chunk By Title
99-
- **Chunk Options**:
66+
[Learn about partitioning strategies](/platform/partitioning).
10067

101-
- **Combine Text Under N Characters**: 0
102-
- **Include Original Elements**: No
103-
- **Max Characters**: 2048
104-
- **New After N Characters**: 1500
105-
- **Overlap**: 160
106-
- **Overlap All**: No
68+
- **Chunker**: **Chunk by Title** strategy
10769

108-
- **Embed**:
70+
- **Contextual Chunking**: No (unchecked)
71+
- **Combine Text Under N Characters**: 3000
72+
- **Include Original Elements**: Yes (checked)
73+
- **Max Characters**: 5500
74+
- **Multipage Sections**: Yes (checked)
75+
- **New After N Characters**: 3500
76+
- **Overlap**: 350
77+
- **Overlap All**: Yes (checked)
10978

110-
- **Provider**: Azure OpenAI
111-
- **Model**: text-embedding-3-large (3072 dimensions)
79+
[Learn about chunking strategies](/platform/chunking).
11280

113-
</Accordion>
114-
<Accordion title="Platinum">
115-
This option is for your most challenging documents, including scanned and handwritten content.
81+
- **Embedder**:
11682

117-
<Note>
118-
During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead.
119-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead.
120-
121-
When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when
122-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
123-
</Note>
83+
- **Provider**: Azure OpenAI
84+
- **Model**: text-embedding-3-large, with 3072 dimensions
12485

125-
The **Platinum** option uses the following preconfigured workflow settings:
86+
[Learn about embedding providers and models](/platform/embedding).
12687

127-
- **Strategy**: VLM
128-
- **VLM Provider, Model**: Anthropic, Anthropic Claude 3.5 Sonnet
129-
- **Chunk**:
88+
- **Enrichments**:
13089

131-
- **Chunker Type**: Chunk By Title
132-
- **Chunk Options**:
90+
This workflow contains no enrichments.
13391

134-
- **Combine Text Under N Characters**: 0
135-
- **Include Original Elements**: No
136-
- **Max Characters**: 2048
137-
= **Multipage Sections**: No
138-
- **New After N Characters**: 1500
139-
- **Overlap**: 160
140-
- **Overlap All**: No
92+
[Learn about available enrichments](/platform/enriching/overview).
14193

142-
- **Embed**:
94+
After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow's
95+
source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments
96+
to the workflow if you want to.
14397

144-
- **Provider**: Azure OpenAI
145-
- **Model**: text-embedding-3-large (3072 dimensions)
98+
To change the workflow's default settings or to add enrichments:
14699

147-
</Accordion>
148-
</AccordionGroup>
100+
1. On the sidebar, click **Workflows**.
101+
2. In the list of available workflows, click the workflow that was just created. This opens a visual designer that shows
102+
your workflow as a directed acyclic graph (DAG). This DAG contains a node representing each step in the workflow.
103+
There is one node for the partitioning step, another node for the chunking step, and so on.
104+
3. To learn how to change a node's settings or to add enrichment nodes, click the **FAQ** button in the flyout pane in
105+
the workflow DAG designer.
149106

150-
9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
107+
If you did not previously set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
151108

152-
- Checking this box reprocesses all documents in the source location on every workflow run.
153-
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
109+
### Create a custom workflow
154110

155-
10. Click **Continue**.
156-
11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
157-
12. Click **Complete**.
158-
13. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
111+
<Tip>
112+
If you already have an existing workflow that you want to change, do the following:
113+
114+
1. On the sidebar, click **Workflows**.
115+
2. Click the name of the workflow that you want to change.
116+
3. Skip ahead to Step 11 in the following procedure.
159117

160-
### Create a custom workflow
118+
</Tip>
161119

162120
<Warning>
163121
You must first have an existing source connector and destination connector to add to the workflow.
@@ -281,6 +239,12 @@ To create an automatic workflow:
281239
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
282240
</Note>
283241

242+
- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
243+
244+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
245+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
246+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
247+
284248
[Learn more](/platform/partitioning).
285249
</Accordion>
286250
<Accordion title="Chunker node">

snippets/quickstarts/platform.mdx

Lines changed: 7 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,9 @@ allowfullscreen
9494
![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png)
9595
1. In the sidebar, click **Workflows**.
9696
2. Click **New Workflow**.
97-
3. Next to **Build it with Me**, click **Create Workflow**.
97+
3. Next to **Build it for Me**, click **Create Workflow**.
9898

99-
<Note>If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**.</Note>
99+
<Note>If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**.</Note>
100100

101101
4. For **Workflow Name**, enter some unique name for this workflow.
102102
5. In the **Sources** dropdown list, select your source location from Step 3.
@@ -105,26 +105,14 @@ allowfullscreen
105105
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>
106106

107107
7. Click **Continue**.
108-
8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups:
109-
110-
- **Basic**: Ideal for simple, text-only documents.
111-
- **Advanced**: Best for PDFs, images, and complex file types.
112-
- **Platinum**: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs).
113-
During processing, files that are not PDFs or images are processed by using the **Advanced** strategy and are charged at the **Advanced** rate instead.
114-
115-
<Note>
116-
When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when
117-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
118-
</Note>
119-
120-
9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
108+
8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
121109

122110
- Checking this box reprocesses all documents in the source location on every workflow run.
123111
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
124-
125-
10. Click **Continue**.
126-
11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
127-
12. Click **Complete**.
112+
113+
9. Click **Continue**.
114+
10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
115+
11. Click **Complete**.
128116
</Step>
129117
<Step title="Process the documents">
130118
![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png)

0 commit comments

Comments
 (0)