You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: platform/overview.mdx
+12-6Lines changed: 12 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,20 +36,26 @@ flowchart LR
36
36
<Steptitle="Route">
37
37
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
38
38
39
-
- **Basic** / **Fast** is ideal for simple, text-only documents.
40
-
- **Advanced** / **High Res** is best for PDFs, images, and complex file types.
39
+
- **Fast** is ideal for simple, text-only documents.
40
+
- **High Res** is best for PDFs, images, and complex file types.
41
41
42
42
<Note>
43
-
During **Advanced** / **High Res** processing, any detected text-based files are processed and billed at the**Basic** /**Fast** rate instead.
43
+
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
44
44
</Note>
45
45
46
-
-**Platinum** / **VLM** is for challenging documents, including scanned and handwritten content.
46
+
-**VLM** is for challenging documents, including scanned and handwritten content.
47
47
48
48
<Note>
49
-
During **Platinum** / **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** / **High Res** or**Basic** /**Fast** rate instead.
50
-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** / **Fast** rate instead. The other files are processed and billed at the**Advanced** /**High Res** rate instead.
49
+
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
50
+
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
51
51
</Note>
52
52
53
+
-**Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
54
+
55
+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
56
+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
57
+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
58
+
53
59
</Step>
54
60
<Steptitle="Transform">
55
61
Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.
Copy file name to clipboardExpand all lines: platform/partitioning.mdx
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,3 +36,9 @@ To choose one of these strategies, select one of the **Partition Strategy** opti
36
36
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
37
37
</Note>
38
38
39
+
-**Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
40
+
41
+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
42
+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
43
+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
Copy file name to clipboardExpand all lines: platform/workflows.mdx
+59-95Lines changed: 59 additions & 95 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,12 +14,10 @@ Workflows are crucial for establishing a systematic approach to managing data fl
14
14
15
15
## Create a workflow
16
16
17
-

18
-
19
17
The Unstructured Platform provides two types of workflow builders:
20
18
21
-
-[Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
22
-
-[Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
19
+
-[Automatic](#create-an-automatic-workflow)or **Build it For Me**workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
20
+
-[Custom](#create-a-custom-worklow)or **Build it Myself**workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
23
21
24
22
### Create an automatic workflow
25
23
@@ -35,9 +33,9 @@ To create an automatic workflow:
35
33
36
34
1. On the sidebar, click **Workflows**.
37
35
2. Click **New Workflow**.
38
-
3. Next to **Build it with Me**, click **Create Workflow**.
36
+
3. Next to **Build it for Me**, click **Create Workflow**.
39
37
40
-
<Note>If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**.</Note>
38
+
<Note>If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**.</Note>
41
39
42
40
4. For **Workflow Name**, enter some unique name for this workflow.
43
41
5. In the **Sources** dropdown list, select your source location.
@@ -46,118 +44,78 @@ To create an automatic workflow:
46
44
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>
47
45
48
46
7. Click **Continue**.
49
-
8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups. Expand any or all
50
-
of the following options to learn more about these preconfigured settings:
51
-
52
-
<AccordionGroup>
53
-
<Accordiontitle="Basic">
54
-
This option is ideal for simple, text-only documents.
55
-
56
-
The **Basic** option uses the following preconfigured workflow settings:
57
-
58
-
-**Strategy**: Fast
59
-
- **Image Summarizer**: None
60
-
- **Table Summarizer**: None
61
-
- **Include Page Breaks**: No
62
-
- **Infer Table Structure**: No
63
-
- **Elements to Exclude**: None
64
-
- **Chunk**:
47
+
8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
65
48
66
-
-**Chunker Type**: Chunk By Character
67
-
- **Chunk Options**:
68
-
69
-
-**Include Original Elements**: No
70
-
- **Max Characters**: 2048
71
-
- **New After N Characters**: 1500
72
-
- **Overlap**: 160
73
-
- **Overlap All**: No
74
-
75
-
-**Embed**:
49
+
- Checking this box reprocesses all documents in the source location on every workflow run.
50
+
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
54
+
11. Click **Complete**.
79
55
80
-
</Accordion>
81
-
<Accordiontitle="Advanced">
82
-
This option is best for PDFs, images, and complex file types.
56
+
By default, this workflow partitions, chunks, and generates embeddings as follows:
83
57
84
-
<Note>
85
-
During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead.
86
-
</Note>
58
+
-**Partitioner**: **Auto** strategy
87
59
88
-
The **Advanced** option uses the following preconfigured workflow settings:
60
+
Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
89
61
90
-
-**Strategy**: High-Res
91
-
-**Image Summarizer**: GPT-4o
92
-
-**Table Summarizer**: Claude 3.5 Sonnet
93
-
-**Include Page Breaks**: No
94
-
-**Infer Table Structure**: No
95
-
-**Elements to Exclude**: None
96
-
-**Chunk**:
62
+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
63
+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
64
+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
97
65
98
-
-**Chunker Type**: Chunk By Title
99
-
-**Chunk Options**:
66
+
[Learn about partitioning strategies](/platform/partitioning).
[Learn about chunking strategies](/platform/chunking).
112
80
113
-
</Accordion>
114
-
<Accordiontitle="Platinum">
115
-
This option is for your most challenging documents, including scanned and handwritten content.
81
+
-**Embedder**:
116
82
117
-
<Note>
118
-
During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead.
119
-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead.
120
-
121
-
When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when
122
-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
123
-
</Note>
83
+
-**Provider**: Azure OpenAI
84
+
-**Model**: text-embedding-3-large, with 3072 dimensions
124
85
125
-
The **Platinum** option uses the following preconfigured workflow settings:
86
+
[Learn about embedding providers and models](/platform/embedding).
126
87
127
-
-**Strategy**: VLM
128
-
-**VLM Provider, Model**: Anthropic, Anthropic Claude 3.5 Sonnet
129
-
-**Chunk**:
88
+
-**Enrichments**:
130
89
131
-
-**Chunker Type**: Chunk By Title
132
-
-**Chunk Options**:
90
+
This workflow contains no enrichments.
133
91
134
-
-**Combine Text Under N Characters**: 0
135
-
-**Include Original Elements**: No
136
-
-**Max Characters**: 2048
137
-
= **Multipage Sections**: No
138
-
-**New After N Characters**: 1500
139
-
-**Overlap**: 160
140
-
-**Overlap All**: No
92
+
[Learn about available enrichments](/platform/enriching/overview).
141
93
142
-
-**Embed**:
94
+
After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow's
95
+
source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments
To change the workflow's default settings or to add enrichments:
146
99
147
-
</Accordion>
148
-
</AccordionGroup>
100
+
1. On the sidebar, click **Workflows**.
101
+
2. In the list of available workflows, click the workflow that was just created. This opens a visual designer that shows
102
+
your workflow as a directed acyclic graph (DAG). This DAG contains a node representing each step in the workflow.
103
+
There is one node for the partitioning step, another node for the chunking step, and so on.
104
+
3. To learn how to change a node's settings or to add enrichment nodes, click the **FAQ** button in the flyout pane in
105
+
the workflow DAG designer.
149
106
150
-
9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
107
+
If you did not previously set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
151
108
152
-
- Checking this box reprocesses all documents in the source location on every workflow run.
153
-
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
109
+
### Create a custom workflow
154
110
155
-
10. Click **Continue**.
156
-
11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
157
-
12. Click **Complete**.
158
-
13. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
111
+
<Tip>
112
+
If you already have an existing workflow that you want to change, do the following:
113
+
114
+
1. On the sidebar, click **Workflows**.
115
+
2. Click the name of the workflow that you want to change.
116
+
3. Skip ahead to Step 11 in the following procedure.
159
117
160
-
### Create a custom workflow
118
+
</Tip>
161
119
162
120
<Warning>
163
121
You must first have an existing source connector and destination connector to add to the workflow.
@@ -281,6 +239,12 @@ To create an automatic workflow:
281
239
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
282
240
</Note>
283
241
242
+
-**Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
243
+
244
+
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
245
+
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
246
+
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
Copy file name to clipboardExpand all lines: snippets/quickstarts/platform.mdx
+7-19Lines changed: 7 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,9 +94,9 @@ allowfullscreen
94
94

95
95
1. In the sidebar, click **Workflows**.
96
96
2. Click **New Workflow**.
97
-
3. Next to **Build it with Me**, click **Create Workflow**.
97
+
3. Next to **Build it for Me**, click **Create Workflow**.
98
98
99
-
<Note>If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**.</Note>
99
+
<Note>If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**.</Note>
100
100
101
101
4. For **Workflow Name**, enter some unique name for this workflow.
102
102
5. In the **Sources** dropdown list, select your source location from Step 3.
@@ -105,26 +105,14 @@ allowfullscreen
105
105
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>
106
106
107
107
7. Click **Continue**.
108
-
8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups:
109
-
110
-
-**Basic**: Ideal for simple, text-only documents.
111
-
-**Advanced**: Best for PDFs, images, and complex file types.
112
-
-**Platinum**: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs).
113
-
During processing, files that are not PDFs or images are processed by using the **Advanced** strategy and are charged at the **Advanced** rate instead.
114
-
115
-
<Note>
116
-
When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when
117
-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
118
-
</Note>
119
-
120
-
9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
108
+
8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
121
109
122
110
- Checking this box reprocesses all documents in the source location on every workflow run.
123
111
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.
124
-
125
-
10. Click **Continue**.
126
-
11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
127
-
12. Click **Complete**.
112
+
113
+
9. Click **Continue**.
114
+
10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
115
+
11. Click **Complete**.
128
116
</Step>
129
117
<Steptitle="Process the documents">
130
118

0 commit comments