You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ui/chunking.mdx
+64-11Lines changed: 64 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,20 +7,20 @@ the limits of an embedding model and to improve retrieval precision. The goal is
7
7
that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks
8
8
those elements, based on your intended end use.
9
9
10
-
During chunking, Unstructured uses a basic chunking strategy that attempts to combine two or more consecutive text elements
11
-
into each chunk that fits together within **Max characters**. To determine the best **Max characters** length, see the documentation
10
+
During chunking, Unstructured uses a [basic](#basic-chunking-strategy) chunking strategy that attempts to combine two or more consecutive text elements
11
+
into each chunk that fits together within the [max characters](#max-characters-setting) setting. To determine the best max characters setting, see the documentation
12
12
for the embedding model that you want to use.
13
13
14
-
You can further control this behavior with by-title, by-page, or by-similarity chunking strategies.
15
-
In all cases, Unstructured will only split individual elements if they exceed the specified **Max characters** length.
14
+
You can further control this behavior with [by title](#chunk-by-title-strategy), [by page](#chunk-by-page-strategy), and [by similarity](#chunk-by-similarity-strategy) chunking strategies.
15
+
In all cases, Unstructured will only split individual elements if they exceed the specified max characters length.
16
16
After chunking, you will have document elements of only the following types:
17
17
18
18
-`CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a
19
-
combination of two or more original text elements that together fit within the **Max characters** length. It can also be a single
19
+
combination of two or more original text elements that together fit within the max characters setting. It can also be a single
20
20
element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
21
21
text element that was too big to fit in one chunk and required splitting.
22
-
-`Table`: A table element is not combined with other elements, and if it fits within **Max characters** it will remain as is.
23
-
-`TableChunk`: Large tables that exceed **Max characters** are split into special `TableChunk` elements.
22
+
-`Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
23
+
-`TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.
24
24
25
25
Here are a few examples:
26
26
@@ -65,17 +65,70 @@ The following sections provide information about the available chunking strategi
65
65
66
66
## Basic chunking strategy
67
67
68
-
The basic chunking strategy uses only **Max characters** and **New after n characters** to combine sequential elements to maximally fill each chunk.
69
-
This strategy does not use section boundaries, page boundaries, or content similarities to determine the chunks' contents.
68
+
The basic chunking strategy uses only the [max characters](#max-characters-setting) setting (an absolute or "hard" limit) and
69
+
[new after n characters](#new-after-n-characters-setting) setting (an approximate or "soft" limit) to combine sequential elements to maximally
70
+
fill each chunk.
71
+
72
+
This strategy adds elements to a chunk until the new after n characters limit is reached. A new chunk is then started.
73
+
No chunk will exceed the max characters limit. For elements larger than the "max characters" limit, the text is split into
74
+
multiple chunks at spaces or new lines to avoid cutting words.
75
+
76
+
Table elements are always treated as standalone chunks. If a table is too large, the table is chunked by rows.
77
+
78
+
This strategy does not use section boundaries, page boundaries, or content similarities to determine
79
+
the chunks' contents.
80
+
81
+
The following diagram illustrates conceptually how a candidate element is chunked to fit within the max characters (hard) and
82
+
new after n characters (soft) limits:
83
+
84
+

85
+
86
+
Context between chunks can be maintained by using the [overlap](#overlap-setting) and [overlap all](#overlap-all-setting) settings.
87
+
The overlap setting repeats the specified number of characters from the end of the previous chunk at the beginning of the next chunk.
88
+
By default, overlap all is applied only to relatively large elements If overlap all is set to true, the overlap is applied to all chunks, regardless.
89
+
90
+
The overlap setting is based on the number of characters, so words might be split.
91
+
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.
92
+
93
+
The following diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram,
94
+
setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.
95
+
By default (or by setting overalp all to false) results in only a portion at the end of Element 6 Part 1 in Chunk 2 being copied over
96
+
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting:
97
+
98
+

70
99
71
100
To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow.
72
101
73
102
## Chunk by title strategy
74
103
75
-
The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents.
104
+
The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents, primarily when
105
+
a **Title** element is encountered. The title is used as the section header for the chunk. The max characters and new after n
106
+
characters settings are still respected.
107
+
108
+
The following diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
109
+
Chunks 1, 4, and 6), while still respecting the max characters and new after n characters settings (see Chunks 2 and 3):
110
+
111
+

112
+
76
113
A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing
77
114
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.
78
115
116
+
The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks:
117
+
118
+

119
+
120
+
To reduce the number of chunks, you can use the [combine text under n characters](#combine-text-under-n-characters-setting) setting. This
121
+
settings attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
122
+
following conceptual diagram:
123
+
124
+

125
+
126
+
Setting combine text under n characters to a value equal to or greater than the new after n characters setting is not recommended, as it
127
+
can result in substantially longer chunks overall and also pushing titles by themselves into previous chunks. The following conceptual
128
+
diagram illustrates this point:
129
+
130
+

131
+
79
132
To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow.
80
133
81
134
## Chunk by page strategy
@@ -86,7 +139,7 @@ chunk is closed and a new one is started, even if the next element would fit in
86
139
87
140
To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow.
88
141
89
-
## Chunk By similarity strategy
142
+
## Chunk by similarity strategy
90
143
91
144
The by-similarity chunking strategy uses the
92
145
[sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model
0 commit comments