You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: api-reference/how-to/choose-partitioning-strategy.mdx
+2-8Lines changed: 2 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,12 +42,6 @@ See [Changing partition strategy for a PDF](/api-reference/api-services/examples
42
42
43
43
## Auto partitioning strategy logic
44
44
45
-
Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a file-by-file basis about which partitioning strategy to use. Specifically:
45
+
Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a page-by-page basis about which partitioning strategy to use.
46
46
47
-
- If the file is an image, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used.
48
-
- If the file is a PDF, the local processing logic or Unstructured tries to detect whether there are any embedded tables or images in that file.
49
-
50
-
- If no embedded tables or images are detected, the `fast` strategy is used for that file. No high-resolution object detection model is used.
51
-
- If at least one embedded table or image is found, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used.
52
-
53
-
- If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default.
47
+
If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default.
The Unstructured Platform offers multiple [source connectors](/platform/sources/overview) to connect to your data in its existing location.
35
37
</Step>
36
38
<Steptitle="Route">
37
-
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these[partitioning](/platform/partitioning) strategies for document transformation:
39
+
Routing determines which strategy Unstructured Platform uses to transform your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides four[partitioning](/platform/partitioning) strategies for document transformation, as follows.
38
40
39
-
- **Fast** is ideal for simple, text-only documents.
40
-
- **High Res** is best for PDFs, images, and complex file types.
41
-
42
-
<Note>
43
-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
44
-
</Note>
45
-
46
-
-**VLM** is for challenging documents, including scanned and handwritten content.
47
-
48
-
<Note>
49
-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
50
-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
51
-
</Note>
52
-
53
-
-**Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
54
-
55
-
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
56
-
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
57
-
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
58
-
41
+
<PlatformPartitioningStrategies />
59
42
</Step>
60
43
<Steptitle="Transform">
61
44
Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
27
-
</Note>
28
-
29
-
-**VLM**: For your most challenging documents, including scanned and handwritten content.
30
-
31
-
<Note>
32
-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
33
-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
34
-
35
-
When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
36
-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
37
-
</Note>
38
-
39
-
-**Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
40
-
41
-
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
42
-
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
43
-
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
For **Partition Strategy**, choose one of the following:
203
-
204
-
-**Fast**: Ideal for simple, text-only documents.
205
-
- **High Res**: Best for PDFs, images, and complex file types.
206
-
207
-
<Note>
208
-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
209
-
</Note>
204
+
Choose from one of four available partitioning strategies.
210
205
211
-
-**VLM**: For your most challenging documents, including scanned and handwritten content.
206
+
<PlatformPartitioningStrategies />
212
207
213
-
You must also choose a VLM provider and model. Available choices include:
208
+
For **VLM**, you must also choose a VLM provider and model. Available choices include:
214
209
215
210
-**Anthropic**:
216
211
@@ -232,19 +227,10 @@ If you did not previously set the workflow to run on a schedule, you can [run th
232
227
- **Meta Llama 3.2 11B Instruct**
233
228
234
229
<Note>
235
-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
236
-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
237
-
238
230
When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
239
231
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
240
232
</Note>
241
233
242
-
-**Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
243
-
244
-
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
245
-
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
246
-
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
0 commit comments