Skip to content

Commit 02ab04b

Browse files
authored
Chunker maximum character size per embedding model (#769)
1 parent afdcb6b commit 02ab04b

File tree

4 files changed

+53
-1
lines changed

4 files changed

+53
-1
lines changed

api-reference/workflow/workflows.mdx

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1830,7 +1830,12 @@ Fields for `settings` include:
18301830

18311831
### Embedder node
18321832

1833-
An **Embedder** node has a `type` of `embed`.
1833+
An **Embedder** node has a `type` of `embed`.
1834+
1835+
<Warning>
1836+
If you add an embedder node, you must set the workflow's chunker node's `max_characters` setting to a value at or below Unstructured's recommended
1837+
maximum chunk size for your specified embedding model. [Learn more](/ui/embedding#chunk-sizing-and-embedding-models).
1838+
</Warning>
18341839

18351840
[Learn about the available embedding providers and models](/ui/embedding).
18361841

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
If your workflow has an [Embedder](/ui/embedding) node, your workflow's [Chunker](/ui/chunking) node settings must stay within the selected embedding model's token limits.
2+
Exceeding these limits will cause workflow failures.
3+
4+
Set your **Chunker** node's **Max Characters** to a value at or below Unstructured's recommended maximum chunk size for your selected embedding model,
5+
as listed in the following table's last column.
6+
7+
| Embedding model | Dimensions | Tokens | Chunker Max Characters<sup>*</sup> |
8+
|---|---|---|---|
9+
| _Amazon Bedrock_ | | | |
10+
| Cohere Embed English | 1024 | 512 | 1792 |
11+
| Cohere Embed Multilingual | 1024 | 512 | 1792 |
12+
| Titan Embeddings G1 - Text | 1536 | 8192 | 28672 |
13+
| Titan Multimodal Embeddings G1 | 1024 | 256 | 896 |
14+
| Titan Text Embeddings V2 | 1024 | 8192 | 28672 |
15+
| _Azure OpenAI_ | | | |
16+
| Text Embedding 3 Large | 3072 | 8192 | 28672 |
17+
| Text Embedding 3 Small | 1536 | 8192 | 28672 |
18+
| Text Embedding Ada 002 | 1536 | 8192 | 28672 |
19+
| _Together AI_ | | | |
20+
| M2-Bert 80M 32K Retrieval | 768 | 8192 | 28672 |
21+
| _Voyage AI_ | | | |
22+
| Voyage 3 | 1024 | 32000 | 112000 |
23+
| Voyage 3 Large | 1024 | 32000 | 112000 |
24+
| Voyage 3 Lite | 512 | 32000 | 112000 |
25+
| Voyage Code 2 | 1536 | 16000| 56000 |
26+
| Voyage Code 3 | 1024 | 32000 | 112000 |
27+
| Voyage Finance 2 | 1024 | 32000| 112000 |
28+
| Voyage Law 2 | 1024 | 16000 | 56000 |
29+
| Voyage Multimodal 3 | 1024 | 32000 | 112000 |
30+
31+
<sup>*</sup> This is an approximate value, determined by multiplying the embedding model's token limit by 3.5.

ui/embedding.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,9 @@ When choosing an embedding model, be sure to pay attention to the number of dime
7575
embeddings field of your destination connector's table, collection, or index.
7676

7777
<Note>You can change a workflow's preconfigured provider only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.</Note>
78+
79+
## Chunk sizing and embedding models
80+
81+
import ChunkLimitsEmbeddingModels from '/snippets/general-shared-text/chunk-limits-embedding-models.mdx';
82+
83+
<ChunkLimitsEmbeddingModels />

ui/workflows.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,11 @@ If you did not previously set the workflow to run on a schedule, you can [run th
218218

219219
- Click **Transform** to add a **Partitioner** or **Embedder** node. [Learn more](#custom-workflow-node-types).
220220

221+
<Warning>
222+
If you add an **Embedder** node, you must set the **Chunker** node's **Max Characters** setting to a value at or below Unstructured's recommended
223+
maximum chunk size for your selected embedding model. [Learn more](/ui/embedding#chunk-sizing-and-embedding-models).
224+
</Warning>
225+
221226
<Tip>
222227
Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears
223228
in the node's settings pane.
@@ -335,6 +340,11 @@ import DeprecatedModelsUI from '/snippets/general-shared-text/deprecated-models-
335340
<Accordion title="Embedder node">
336341
For **Select Embedding Model**, select one of the available models that are shown.
337342

343+
<Warning>
344+
If you add an **Embedder** node, you must set the **Chunker** node's **Max Characters** setting to a value at or below Unstructured's recommended
345+
maximum chunk size for your selected embedding model. [Learn more](/ui/embedding#chunk-sizing-and-embedding-models).
346+
</Warning>
347+
338348
Learn more:
339349

340350
- [Embedding overview](/ui/embedding)

0 commit comments

Comments
 (0)