Skip to content

Commit 1702504

Browse files
authored
docs(chunking): document how to customize the way to split using regex (#585)
1 parent 1333d0d commit 1702504

File tree

1 file changed

+21
-3
lines changed

1 file changed

+21
-3
lines changed

docs/docs/ops/functions.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,27 @@ Input data:
3939

4040
* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
4141
* `language` (type: `str`, optional): The language of the document.
42-
Can be a langauge name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
43-
To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
44-
If it's unspecified or the specified language is not supported, it will be treated as plain text.
42+
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
43+
44+
* `custom_languages` (type: `list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
45+
* `language_name` (type: `str`, required): Name of the language.
46+
* `aliases` (type: `list[str]`, optional): A list of aliases for the language.
47+
It's an error if any language name or alias is duplicated.
48+
49+
* `separators_regex` (type: `list[str]`, required): A list of regex patterns to split the text.
50+
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
51+
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
52+
53+
:::note
54+
55+
We use the `language` field to determine how to split the input text, following these rules:
56+
57+
* We'll match the input `language` field against the `language_name` or `aliases` of each custom language specification, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
58+
* If no match is found, we'll match the `language` field against the builtin language configurations.
59+
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
60+
* If no match is found, the input will be treated as plain text.
61+
62+
:::
4563

4664
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
4765

0 commit comments

Comments
 (0)