You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ui/chunking.mdx
+36-1Lines changed: 36 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,9 +83,21 @@ new after n characters (soft) limits:
83
83
84
84

85
85
86
+
The following two diagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.
87
+
88
+
In this first diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:
89
+
90
+

91
+
92
+
In this second diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
93
+
row endings are also considered in determining chunk boundaries. For this table, the first chunk is close to the 200-character hard limit and also a row ending.
94
+
The second chunk is well short of the 200-character hard limit because of a row (and, in this case, also the table) ending:
95
+
96
+

97
+
86
98
Context between chunks can be maintained by using the [overlap](#overlap-setting) and [overlap all](#overlap-all-setting) settings.
87
99
The overlap setting repeats the specified number of characters from the end of the previous chunk at the beginning of the next chunk.
88
-
By default, overlap all is applied only to relatively large elements If overlap all is set to true, the overlap is applied to all chunks, regardless.
100
+
By default, overlap all is applied only to relatively large elements. If overlap all is set to true, the overlap is applied to all chunks, regardless.
89
101
90
102
The overlap setting is based on the number of characters, so words might be split.
91
103
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.
@@ -97,6 +109,11 @@ to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger tha
97
109
98
110

99
111
112
+
The following diagram shows how a basic chunking strategy with a max characters setting of 200, an overlap of 25 characters, and
113
+
overlap all set to true would chunk the following text. Note that some of the text is split in the middle of a word:
114
+
115
+

116
+
100
117
To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow.
101
118
102
119
## Chunk by title strategy
@@ -129,6 +146,12 @@ diagram illustrates this point:
129
146
130
147

131
148
149
+
The following diagram shows how a chunk by title strategy with a max characters setting of 200 would chunk the following text.
150
+
Although the first chunk is close to the 200-character hard limit, the second chunk is well short of this limit due to encountering the
151
+
title immediately after it, which starts a new chunk:
152
+
153
+

154
+
132
155
To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow.
133
156
134
157
## Chunk by page strategy
@@ -137,6 +160,12 @@ The by-page chunking strategy attempts to preserve page boundaries when determin
137
160
A single chunk should not contain text that occurred in two different page. When a new page starts, the existing
138
161
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.
139
162
163
+
The following diagram shows how a chunk by page strategy with a max characters setting of 200 would chunk the following text.
164
+
Notice that due to the page break, the second chunk is very small, as it could not fit into the first chunk's hard character limit.
165
+
Nonetheless, the second chunk is still part of same page as the first chunk:
166
+
167
+

168
+
140
169
To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow.
141
170
142
171
## Chunk by similarity strategy
@@ -153,6 +182,12 @@ To use this chunking strategy, choose **Chunk by similarity** in the **Chunkers*
153
182
154
183
You can control the level of topic similarity you require for elements to have by setting [Similarity threshold](#similarity-threshold).
155
184
185
+
The following diagram shows how a chunk by similarity strategy with a max characters setting of 1000 and similarity threshold of 0.5 would chunk the following text.
186
+
Notice that the two chunks are well short of the 1000-character hard limit, as the paragraph break introduces a convenient lexical construct for
187
+
helping determinine the similarities of sentences to each other:
188
+
189
+

190
+
156
191
## Max characters setting
157
192
158
193
Specifies the absolute maximum number of characters in a chunk.
0 commit comments