Skip to content

Commit 7370c28

Browse files
authored
Partitioning strategies: separate open source from Platform (#516)
1 parent 5f57324 commit 7370c28

File tree

4 files changed

+20
-3
lines changed

4 files changed

+20
-3
lines changed

open-source/concepts/partitioning-strategies.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Partitioning strategies
33
---
44

5-
import PartitioningStrategies from '/snippets/concepts/partitioning-strategies.mdx';
5+
import PartitioningStrategies from '/snippets/concepts/partitioning-strategies-oss.mdx';
66

77
<PartitioningStrategies/>
88

platform-api/partition-api/partitioning.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Partitioning strategies
33
---
44

5-
import PartitioningStrategies from '/snippets/concepts/partitioning-strategies.mdx';
5+
import PartitioningStrategies from '/snippets/concepts/partitioning-strategies-platform.mdx';
66

77
<PartitioningStrategies/>
88

snippets/concepts/partitioning-strategies.mdx renamed to snippets/concepts/partitioning-strategies-oss.mdx

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ To give you an example, the `fast` strategy is roughly 100x faster than leading
1515
* `fast`: The "rule-based" strategy leverages traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not recommended for image-based file types.
1616
* `hi_res`: The "model-based" strategy identifies the layout of the document. The advantage of "hi_res" is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.
1717
* `ocr_only`: Another "model-based" strategy that leverages Optical Character Recognition to extract text from the image-based files.
18-
* `vlm`: Uses a vision language model (VLM) to extract text from these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`.
1918

2019
**These strategies are available on the following partition functions:**
2120

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
2+
For certain document types, such as images and PDFs, for example, Unstructured products offer a variety of different
3+
ways to preprocess them, controlled by the `strategy` parameter.
4+
5+
PDF documents, for example, vary in quality and complexity. In simple cases, traditional NLP extraction techniques may
6+
be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required
7+
to process a PDF. You can think of the strategies as being "rule-based" workflows (thus they are "fast"), or
8+
"model-based" workflows (slower workflow because it requires model inference, but you get "higher resolution", thus "hi_res").
9+
When choosing a partitioning strategy for your files, you have to be mindful of the quality/speed trade-off.
10+
To give you an example, the `fast` strategy is roughly 100x faster than leading image-to-text models.
11+
12+
**Available options:**
13+
14+
* `auto` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
15+
* `fast`: The "rule-based" strategy leverages traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not recommended for image-based file types.
16+
* `hi_res`: The "model-based" strategy identifies the layout of the document. The advantage of "hi_res" is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.
17+
* `ocr_only`: Another "model-based" strategy that leverages Optical Character Recognition to extract text from the image-based files.
18+
* `vlm`: Uses a vision language model (VLM) to extract text from these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`.

0 commit comments

Comments
 (0)