Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 43 additions & 4 deletions docs/docs/ops/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ The spec takes the following fields:

* `separators_regex` (`list[str]`): A list of regex patterns to split the text.
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
See [regex syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.

Input data:

Expand All @@ -57,9 +57,12 @@ Input data:

We use the `language` field to determine how to split the input text, following these rules:

* We'll match the input `language` field against the `language_name` or `aliases` of each element of `custom_languages`, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
* If no match is found, we'll match the `language` field against the builtin language configurations.
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
* We match the input `language` field against the following registries in the following order:
* `custom_languages` in the spec, against the `language_name` or `aliases` field of each entry.
* Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry.

All matches are in a case-insensitive manner. If the value of `language` is null, it'll be treated as empty string.

* If no match is found, the input will be treated as plain text.

:::
Expand All @@ -73,6 +76,42 @@ Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, w
* `line` (*Int64*): The line number of the position. Starting from 1.
* `column` (*Int64*): The column number of the position. Starting from 1.

### Supported Languages

Currently, `SplitRecursively` supports the following languages:

| Language | Aliases | File Extensions |
|----------|---------|-----------------|
| C | | `.c` |
| C++ | CPP | `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` |
| C# | CSharp, CS | `.cs` |
| CSS | | `.css`, `.scss` |
| DTD | | `.dtd` |
| Fortran | F, F90, F95, F03 | `.f`, `.f90`, `.f95`, `.f03` |
| Go | Golang | `.go` |
| HTML | | `.html`, `.htm` |
| Java | | `.java` |
| JavaScript | JS | `.js` |
| JSON | | `.json` |
| Kotlin | | `.kt`, `.kts` |
| Markdown | MD | `.md`, `.mdx` |
| Pascal | PAS, DPR, Delphi | `.pas`, `.dpr` |
| PHP | | `.php` |
| Python | | `.py` |
| R | | `.r` |
| Ruby | | `.rb` |
| Rust | RS | `.rs` |
| Scala | | `.scala` |
| SQL | | `.sql` |
| Swift | | `.swift` |
| TOML | | `.toml` |
| TSX | | `.tsx` |
| TypeScript | TS | `.ts` |
| XML | | `.xml` |
| YAML | | `.yaml`, `.yml` |



## SentenceTransformerEmbed

`SentenceTransformerEmbed` embeds a text into a vector space using the [SentenceTransformer](https://huggingface.co/sentence-transformers) library.
Expand Down
2 changes: 1 addition & 1 deletion src/ops/functions/split_recursively.rs
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ static TREE_SITTER_LANGUAGE_BY_LANG: LazyLock<
add_treesitter_language(
&mut map,
"C#",
[".cs", "cs"],
[".cs", "cs", "csharp"],
tree_sitter_c_sharp::LANGUAGE,
[],
);
Expand Down
Loading