Skip to content

Commit 6091146

Browse files
authored
Chunking: more conceptual information and illustrative diagrams (#605)
1 parent d064987 commit 6091146

File tree

7 files changed

+64
-11
lines changed

7 files changed

+64
-11
lines changed

img/chunking/Chunking_By_Title.png

133 KB
Loading
89.6 KB
Loading
76.1 KB
Loading
121 KB
Loading
122 KB
Loading
67.1 KB
Loading

ui/chunking.mdx

Lines changed: 64 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,20 @@ the limits of an embedding model and to improve retrieval precision. The goal is
77
that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks
88
those elements, based on your intended end use.
99

10-
During chunking, Unstructured uses a basic chunking strategy that attempts to combine two or more consecutive text elements
11-
into each chunk that fits together within **Max characters**. To determine the best **Max characters** length, see the documentation
10+
During chunking, Unstructured uses a [basic](#basic-chunking-strategy) chunking strategy that attempts to combine two or more consecutive text elements
11+
into each chunk that fits together within the [max characters](#max-characters-setting) setting. To determine the best max characters setting, see the documentation
1212
for the embedding model that you want to use.
1313

14-
You can further control this behavior with by-title, by-page, or by-similarity chunking strategies.
15-
In all cases, Unstructured will only split individual elements if they exceed the specified **Max characters** length.
14+
You can further control this behavior with [by title](#chunk-by-title-strategy), [by page](#chunk-by-page-strategy), and [by similarity](#chunk-by-similarity-strategy) chunking strategies.
15+
In all cases, Unstructured will only split individual elements if they exceed the specified max characters length.
1616
After chunking, you will have document elements of only the following types:
1717

1818
- `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a
19-
combination of two or more original text elements that together fit within the **Max characters** length. It can also be a single
19+
combination of two or more original text elements that together fit within the max characters setting. It can also be a single
2020
element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
2121
text element that was too big to fit in one chunk and required splitting.
22-
- `Table`: A table element is not combined with other elements, and if it fits within **Max characters** it will remain as is.
23-
- `TableChunk`: Large tables that exceed **Max characters** are split into special `TableChunk` elements.
22+
- `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
23+
- `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.
2424

2525
Here are a few examples:
2626

@@ -65,17 +65,70 @@ The following sections provide information about the available chunking strategi
6565

6666
## Basic chunking strategy
6767

68-
The basic chunking strategy uses only **Max characters** and **New after n characters** to combine sequential elements to maximally fill each chunk.
69-
This strategy does not use section boundaries, page boundaries, or content similarities to determine the chunks' contents.
68+
The basic chunking strategy uses only the [max characters](#max-characters-setting) setting (an absolute or "hard" limit) and
69+
[new after n characters](#new-after-n-characters-setting) setting (an approximate or "soft" limit) to combine sequential elements to maximally
70+
fill each chunk.
71+
72+
This strategy adds elements to a chunk until the new after n characters limit is reached. A new chunk is then started.
73+
No chunk will exceed the max characters limit. For elements larger than the "max characters" limit, the text is split into
74+
multiple chunks at spaces or new lines to avoid cutting words.
75+
76+
Table elements are always treated as standalone chunks. If a table is too large, the table is chunked by rows.
77+
78+
This strategy does not use section boundaries, page boundaries, or content similarities to determine
79+
the chunks' contents.
80+
81+
The following diagram illustrates conceptually how a candidate element is chunked to fit within the max characters (hard) and
82+
new after n characters (soft) limits:
83+
84+
![Chunking with hard and soft limits](/img/chunking/Chunking_Soft_Hard_Limits.png)
85+
86+
Context between chunks can be maintained by using the [overlap](#overlap-setting) and [overlap all](#overlap-all-setting) settings.
87+
The overlap setting repeats the specified number of characters from the end of the previous chunk at the beginning of the next chunk.
88+
By default, overlap all is applied only to relatively large elements If overlap all is set to true, the overlap is applied to all chunks, regardless.
89+
90+
The overlap setting is based on the number of characters, so words might be split.
91+
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.
92+
93+
The following diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram,
94+
setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.
95+
By default (or by setting overalp all to false) results in only a portion at the end of Element 6 Part 1 in Chunk 2 being copied over
96+
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting:
97+
98+
![Chunking with overall all set to true or false](/img/chunking/Chunking_Overlap_All.png)
7099

71100
To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow.
72101

73102
## Chunk by title strategy
74103

75-
The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents.
104+
The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents, primarily when
105+
a **Title** element is encountered. The title is used as the section header for the chunk. The max characters and new after n
106+
characters settings are still respected.
107+
108+
The following diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
109+
Chunks 1, 4, and 6), while still respecting the max characters and new after n characters settings (see Chunks 2 and 3):
110+
111+
![Chunking by title](/img/chunking/Chunking_By_Title.png)
112+
76113
A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing
77114
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.
78115

116+
The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks:
117+
118+
![Many titles can lead to many chunks by title](/img/chunking/Chunking_By_Title_Segmentation.png)
119+
120+
To reduce the number of chunks, you can use the [combine text under n characters](#combine-text-under-n-characters-setting) setting. This
121+
settings attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
122+
following conceptual diagram:
123+
124+
![Chunking with combine text under n characters](/img/chunking/Chunking_Combine_Text.png)
125+
126+
Setting combine text under n characters to a value equal to or greater than the new after n characters setting is not recommended, as it
127+
can result in substantially longer chunks overall and also pushing titles by themselves into previous chunks. The following conceptual
128+
diagram illustrates this point:
129+
130+
![Chunking with combine text under n characters issue](/img/chunking/Chunking_Combine_Text_Limits.png)
131+
79132
To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow.
80133

81134
## Chunk by page strategy
@@ -86,7 +139,7 @@ chunk is closed and a new one is started, even if the next element would fit in
86139

87140
To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow.
88141

89-
## Chunk By similarity strategy
142+
## Chunk by similarity strategy
90143

91144
The by-similarity chunking strategy uses the
92145
[sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model

0 commit comments

Comments
 (0)