Commit c9ccbcd
feat(chunking): Improve RAG chunking with embedded tokenizer and table support
v1.0.3 release with major chunking improvements:
Embedded Tokenizer:
- Bundle sentence-transformers/all-MiniLM-L6-v2 tokenizer in binary
- Add HuggingFaceTokenizer::default_embedded() for zero-config usage
- Fix truncation issue that was limiting token counts to 128
Chunking Defaults:
- Change default strategy from hierarchical to hybrid
- Reduce default max_tokens from 256 to 128 for better RAG granularity
- Set merge threshold to 75% of max_tokens
Table Chunking (BUG FIX):
- Fix CSV/XLSX producing 0 chunks
- Add rows_as_chunks() method to serialize table rows
- Format: "Name, Column = Value. Name, Column2 = Value2."
Markdown Output:
- Add structured TableData with to_markdown() formatting
- Improve table rendering with proper column alignment
CLI Updates:
- Rename --chunk-size to --chunk-max-tokens
- Add --tokenizer flag for custom HuggingFace models
- Update help text with new defaults
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>1 parent 0c30861 commit c9ccbcd
File tree
20 files changed
+459
-202
lines changed- crates
- docling-rs-cli
- src/cli
- tests
- docling-rs-core
- assets
- src
- chunking
- tokenizer
- datamodel
- docling-rs-formats
- csv/src
- xlsx/src
- docling-rs/src
- manual
20 files changed
+459
-202
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
93 | | - | |
94 | | - | |
| 93 | + | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | | - | |
| 151 | + | |
152 | 152 | | |
153 | 153 | | |
154 | | - | |
| 154 | + | |
155 | 155 | | |
156 | 156 | | |
157 | 157 | | |
158 | 158 | | |
159 | 159 | | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | | - | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
164 | 167 | | |
165 | 168 | | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
| 169 | + | |
| 170 | + | |
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
174 | 181 | | |
175 | 182 | | |
176 | 183 | | |
| |||
184 | 191 | | |
185 | 192 | | |
186 | 193 | | |
187 | | - | |
188 | | - | |
189 | | - | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
190 | 198 | | |
191 | 199 | | |
192 | 200 | | |
| |||
198 | 206 | | |
199 | 207 | | |
200 | 208 | | |
201 | | - | |
| 209 | + | |
202 | 210 | | |
203 | 211 | | |
204 | | - | |
205 | | - | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
206 | 217 | | |
207 | | - | |
208 | | - | |
| 218 | + | |
| 219 | + | |
209 | 220 | | |
210 | 221 | | |
211 | 222 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
64 | | - | |
| 65 | + | |
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
68 | | - | |
69 | | - | |
| 69 | + | |
| 70 | + | |
70 | 71 | | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
76 | 82 | | |
77 | 83 | | |
78 | 84 | | |
| |||
154 | 160 | | |
155 | 161 | | |
156 | 162 | | |
157 | | - | |
158 | 163 | | |
159 | | - | |
| 164 | + | |
| 165 | + | |
160 | 166 | | |
161 | 167 | | |
162 | 168 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
359 | 359 | | |
360 | 360 | | |
361 | 361 | | |
362 | | - | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
363 | 383 | | |
364 | | - | |
| 384 | + | |
365 | 385 | | |
366 | 386 | | |
367 | 387 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
| 3 | + | |
2 | 4 | | |
3 | 5 | | |
4 | | - | |
| 6 | + | |
5 | 7 | | |
6 | 8 | | |
7 | 9 | | |
8 | | - | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
| 10 | + | |
33 | 11 | | |
34 | 12 | | |
35 | 13 | | |
36 | 14 | | |
37 | | - | |
| 15 | + | |
38 | 16 | | |
39 | 17 | | |
40 | 18 | | |
41 | 19 | | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
| 20 | + | |
57 | 21 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
195 | | - | |
| 195 | + | |
196 | 196 | | |
197 | 197 | | |
198 | 198 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
74 | | - | |
| 74 | + | |
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
79 | | - | |
| 79 | + | |
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| |||
Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
0 commit comments