Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/docs/ops/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,17 @@ Input data:

* `text` (type: `str`, required): The text to split.
* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
* `min_chunk_size` (type: `int`, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.

:::note

`SplitRecursively` will do its best to make the output chunks sized between `min_chunk_size` and `chunk_size`.
However, it's possible that some chunks are smaller than `min_chunk_size` or larger than `chunk_size` in rare cases, e.g. too short input text, or non-splittable large text.

Please avoid setting `min_chunk_size` to a value too close to `chunk_size`, to leave more rooms for the function to plan the optimal chunking.

:::

* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
* `language` (type: `str`, optional): The language of the document.
Can be a langauge name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
Expand Down
3 changes: 2 additions & 1 deletion examples/code_embedding/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def code_to_embedding(
@cocoindex.flow_def(name="CodeEmbedding")
def code_embedding_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
):
) -> None:
"""
Define an example flow that embeds files into a vector database.
"""
Expand All @@ -46,6 +46,7 @@ def code_embedding_flow(
cocoindex.functions.SplitRecursively(),
language=file["extension"],
chunk_size=1000,
min_chunk_size=300,
chunk_overlap=300,
)
with file["chunks"].row() as chunk:
Expand Down
Loading