Skip to content

Commit 85a8f91

Browse files
authored
Platform: Recommend using Auto partitioning strategy whenever possible (#492)
1 parent defd636 commit 85a8f91

File tree

5 files changed

+25
-70
lines changed

5 files changed

+25
-70
lines changed

api-reference/how-to/choose-partitioning-strategy.mdx

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,12 +42,6 @@ See [Changing partition strategy for a PDF](/api-reference/api-services/examples
4242

4343
## Auto partitioning strategy logic
4444

45-
Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a file-by-file basis about which partitioning strategy to use. Specifically:
45+
Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a page-by-page basis about which partitioning strategy to use.
4646

47-
- If the file is an image, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used.
48-
- If the file is a PDF, the local processing logic or Unstructured tries to detect whether there are any embedded tables or images in that file.
49-
50-
- If no embedded tables or images are detected, the `fast` strategy is used for that file. No high-resolution object detection model is used.
51-
- If at least one embedded table or image is found, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used.
52-
53-
- If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default.
47+
If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default.

platform/overview.mdx

Lines changed: 4 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -29,33 +29,16 @@ flowchart LR
2929
Connect-->Route-->Transform-->Chunk-->Enrich-->Embed-->Persist
3030
```
3131

32+
import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx';
33+
3234
<Steps>
3335
<Step title="Connect">
3436
The Unstructured Platform offers multiple [source connectors](/platform/sources/overview) to connect to your data in its existing location.
3537
</Step>
3638
<Step title="Route">
37-
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
39+
Routing determines which strategy Unstructured Platform uses to transform your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides four [partitioning](/platform/partitioning) strategies for document transformation, as follows.
3840

39-
- **Fast** is ideal for simple, text-only documents.
40-
- **High Res** is best for PDFs, images, and complex file types.
41-
42-
<Note>
43-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
44-
</Note>
45-
46-
- **VLM** is for challenging documents, including scanned and handwritten content.
47-
48-
<Note>
49-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
50-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
51-
</Note>
52-
53-
- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
54-
55-
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
56-
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
57-
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
58-
41+
<PlatformPartitioningStrategies />
5942
</Step>
6043
<Step title="Transform">
6144
Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.

platform/partitioning.mdx

Lines changed: 3 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -15,30 +15,11 @@ model-based workflows, which can be slower and costlier because they require a m
1515
When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs.
1616
For example, the **Fast** strategy can be about 100 times faster than leading image-to-text models.
1717

18-
To choose one of these strategies, select one of the **Partition Strategy** options in the **Partitioner** node of a workflow:
18+
To choose one of these strategies, select one of the following four **Partition Strategy** options in the **Partitioner** node of a workflow.
1919

2020
<Note>You can change a workflow's preconfigured strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.</Note>
2121

22-
- **Fast**: This strategy is ideal for simple, text-based documents.
23-
- **High Res**: This strategy is best for PDFs, images, and complex file types.
22+
import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx';
2423

25-
<Note>
26-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
27-
</Note>
28-
29-
- **VLM**: For your most challenging documents, including scanned and handwritten content.
30-
31-
<Note>
32-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
33-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
34-
35-
When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
36-
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
37-
</Note>
38-
39-
- **Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
40-
41-
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
42-
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
43-
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
24+
<PlatformPartitioningStrategies />
4425

platform/workflows.mdx

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -197,20 +197,15 @@ If you did not previously set the workflow to run on a schedule, you can [run th
197197

198198
#### Custom workflow node types
199199

200+
import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx';
201+
200202
<AccordionGroup>
201203
<Accordion title="Partitioner node">
202-
For **Partition Strategy**, choose one of the following:
203-
204-
- **Fast**: Ideal for simple, text-only documents.
205-
- **High Res**: Best for PDFs, images, and complex file types.
206-
207-
<Note>
208-
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
209-
</Note>
204+
Choose from one of four available partitioning strategies.
210205

211-
- **VLM**: For your most challenging documents, including scanned and handwritten content.
206+
<PlatformPartitioningStrategies />
212207

213-
You must also choose a VLM provider and model. Available choices include:
208+
For **VLM**, you must also choose a VLM provider and model. Available choices include:
214209

215210
- **Anthropic**:
216211

@@ -232,19 +227,10 @@ If you did not previously set the workflow to run on a schedule, you can [run th
232227
- **Meta Llama 3.2 11B Instruct**
233228

234229
<Note>
235-
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
236-
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
237-
238230
When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
239231
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
240232
</Note>
241233

242-
- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
243-
244-
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
245-
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
246-
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
247-
248234
[Learn more](/platform/partitioning).
249235
</Accordion>
250236
<Accordion title="Chunker node">
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Unstructured recommends that you choose the **Auto** partitioning strategy in most cases. With **Auto**, Unstructured does all
2+
the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page.
3+
4+
You should consider the following additional strategies only if you are absolutely sure that your documents are of the same
5+
type. Each of the following strategies are best suited for specific situations. Choosing one of these
6+
strategies other than **Auto** for sets of documents of different types could produce undesirable results,
7+
including reduction in transformation quality.
8+
9+
- **VLM**: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`.
10+
- **High Res**: For all other [supported file types](/platform/supported-file-types), and for the generation of bounding box coordinates.
11+
- **Fast**: For text-only documents.

0 commit comments

Comments
 (0)