Skip to content

Commit b2d6de5

Browse files
authored
feat(chunking): implement holistic chunking mechanism (#575)
* feat(chunking): implement holistic chunking mechanism * docs: update docs for `SplitRecursively`.
1 parent a0198c8 commit b2d6de5

File tree

3 files changed

+333
-105
lines changed

3 files changed

+333
-105
lines changed

docs/docs/ops/functions.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,17 @@ Input data:
2626

2727
* `text` (type: `str`, required): The text to split.
2828
* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
29+
* `min_chunk_size` (type: `int`, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
30+
31+
:::note
32+
33+
`SplitRecursively` will do its best to make the output chunks sized between `min_chunk_size` and `chunk_size`.
34+
However, it's possible that some chunks are smaller than `min_chunk_size` or larger than `chunk_size` in rare cases, e.g. too short input text, or non-splittable large text.
35+
36+
Please avoid setting `min_chunk_size` to a value too close to `chunk_size`, to leave more rooms for the function to plan the optimal chunking.
37+
38+
:::
39+
2940
* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
3041
* `language` (type: `str`, optional): The language of the document.
3142
Can be a langauge name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).

examples/code_embedding/main.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def code_to_embedding(
2727
@cocoindex.flow_def(name="CodeEmbedding")
2828
def code_embedding_flow(
2929
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
30-
):
30+
) -> None:
3131
"""
3232
Define an example flow that embeds files into a vector database.
3333
"""
@@ -46,6 +46,7 @@ def code_embedding_flow(
4646
cocoindex.functions.SplitRecursively(),
4747
language=file["extension"],
4848
chunk_size=1000,
49+
min_chunk_size=300,
4950
chunk_overlap=300,
5051
)
5152
with file["chunks"].row() as chunk:

0 commit comments

Comments
 (0)