Skip to content

Commit 4ff2439

Browse files
authored
Chunking overview: examples of chunking boundaries by strategy (#655)
1 parent fac5512 commit 4ff2439

7 files changed

+36
-1
lines changed
491 KB
Loading
797 KB
Loading
874 KB
Loading

img/chunking/Chunk-By-Page-200.png

545 KB
Loading
639 KB
Loading
434 KB
Loading

ui/chunking.mdx

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,9 +83,21 @@ new after n characters (soft) limits:
8383

8484
![Chunking with hard and soft limits](/img/chunking/Chunking_Soft_Hard_Limits.png)
8585

86+
The following two diagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.
87+
88+
In this first diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:
89+
90+
![Basic chunking of text with a 200-character hard limit](/img/chunking/Chunk-By-Character-200-Paragraph.png)
91+
92+
In this second diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
93+
row endings are also considered in determining chunk boundaries. For this table, the first chunk is close to the 200-character hard limit and also a row ending.
94+
The second chunk is well short of the 200-character hard limit because of a row (and, in this case, also the table) ending:
95+
96+
![Basic chunking of a table with a 200-character hard limit](/img/chunking/Chunk-By-Character-200-Table.png)
97+
8698
Context between chunks can be maintained by using the [overlap](#overlap-setting) and [overlap all](#overlap-all-setting) settings.
8799
The overlap setting repeats the specified number of characters from the end of the previous chunk at the beginning of the next chunk.
88-
By default, overlap all is applied only to relatively large elements If overlap all is set to true, the overlap is applied to all chunks, regardless.
100+
By default, overlap all is applied only to relatively large elements. If overlap all is set to true, the overlap is applied to all chunks, regardless.
89101

90102
The overlap setting is based on the number of characters, so words might be split.
91103
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.
@@ -97,6 +109,11 @@ to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger tha
97109

98110
![Chunking with overall all set to true or false](/img/chunking/Chunking_Overlap_All.png)
99111

112+
The following diagram shows how a basic chunking strategy with a max characters setting of 200, an overlap of 25 characters, and
113+
overlap all set to true would chunk the following text. Note that some of the text is split in the middle of a word:
114+
115+
![Basic chunking of text with a 200-character hard limit, an overlap of 25 characters, and overlap all set to true](/img/chunking/Chunk-By-Character-200-Overlap-25.png)
116+
100117
To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow.
101118

102119
## Chunk by title strategy
@@ -129,6 +146,12 @@ diagram illustrates this point:
129146

130147
![Chunking with combine text under n characters issue](/img/chunking/Chunking_Combine_Text_Limits.png)
131148

149+
The following diagram shows how a chunk by title strategy with a max characters setting of 200 would chunk the following text.
150+
Although the first chunk is close to the 200-character hard limit, the second chunk is well short of this limit due to encountering the
151+
title immediately after it, which starts a new chunk:
152+
153+
![Chunking by title with a 200-character hard limit](/img/chunking/Chunk-By-Title-200-Paragraph.png)
154+
132155
To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow.
133156

134157
## Chunk by page strategy
@@ -137,6 +160,12 @@ The by-page chunking strategy attempts to preserve page boundaries when determin
137160
A single chunk should not contain text that occurred in two different page. When a new page starts, the existing
138161
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.
139162

163+
The following diagram shows how a chunk by page strategy with a max characters setting of 200 would chunk the following text.
164+
Notice that due to the page break, the second chunk is very small, as it could not fit into the first chunk's hard character limit.
165+
Nonetheless, the second chunk is still part of same page as the first chunk:
166+
167+
![Chunking by page with a 200-character hard limit](/img/chunking/Chunk-By-Page-200.png)
168+
140169
To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow.
141170

142171
## Chunk by similarity strategy
@@ -153,6 +182,12 @@ To use this chunking strategy, choose **Chunk by similarity** in the **Chunkers*
153182

154183
You can control the level of topic similarity you require for elements to have by setting [Similarity threshold](#similarity-threshold).
155184

185+
The following diagram shows how a chunk by similarity strategy with a max characters setting of 1000 and similarity threshold of 0.5 would chunk the following text.
186+
Notice that the two chunks are well short of the 1000-character hard limit, as the paragraph break introduces a convenient lexical construct for
187+
helping determinine the similarities of sentences to each other:
188+
189+
![Chunking by similarity with a 1000-character hard limit and 0.5 similarity threshold](/img/chunking/Chunk-By-Similarity-1000-50.png)
190+
156191
## Max characters setting
157192

158193
Specifies the absolute maximum number of characters in a chunk.

0 commit comments

Comments
 (0)