Skip to content

Commit 76b1d9e

Browse files
authored
docs(split-recursively): explicitly document supported languages (#773)
* docs(split-recursively): explicitly document supported languages * docs(language): use more human readable cases
1 parent 56607b0 commit 76b1d9e

File tree

2 files changed

+44
-5
lines changed

2 files changed

+44
-5
lines changed

docs/docs/ops/functions.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ The spec takes the following fields:
3131

3232
* `separators_regex` (`list[str]`): A list of regex patterns to split the text.
3333
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
34-
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
34+
See [regex syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
3535

3636
Input data:
3737

@@ -57,9 +57,12 @@ Input data:
5757

5858
We use the `language` field to determine how to split the input text, following these rules:
5959

60-
* We'll match the input `language` field against the `language_name` or `aliases` of each element of `custom_languages`, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
61-
* If no match is found, we'll match the `language` field against the builtin language configurations.
62-
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
60+
* We match the input `language` field against the following registries in the following order:
61+
* `custom_languages` in the spec, against the `language_name` or `aliases` field of each entry.
62+
* Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry.
63+
64+
All matches are in a case-insensitive manner. If the value of `language` is null, it'll be treated as empty string.
65+
6366
* If no match is found, the input will be treated as plain text.
6467

6568
:::
@@ -73,6 +76,42 @@ Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, w
7376
* `line` (*Int64*): The line number of the position. Starting from 1.
7477
* `column` (*Int64*): The column number of the position. Starting from 1.
7578

79+
### Supported Languages
80+
81+
Currently, `SplitRecursively` supports the following languages:
82+
83+
| Language | Aliases | File Extensions |
84+
|----------|---------|-----------------|
85+
| C | | `.c` |
86+
| C++ | CPP | `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` |
87+
| C# | CSharp, CS | `.cs` |
88+
| CSS | | `.css`, `.scss` |
89+
| DTD | | `.dtd` |
90+
| Fortran | F, F90, F95, F03 | `.f`, `.f90`, `.f95`, `.f03` |
91+
| Go | Golang | `.go` |
92+
| HTML | | `.html`, `.htm` |
93+
| Java | | `.java` |
94+
| JavaScript | JS | `.js` |
95+
| JSON | | `.json` |
96+
| Kotlin | | `.kt`, `.kts` |
97+
| Markdown | MD | `.md`, `.mdx` |
98+
| Pascal | PAS, DPR, Delphi | `.pas`, `.dpr` |
99+
| PHP | | `.php` |
100+
| Python | | `.py` |
101+
| R | | `.r` |
102+
| Ruby | | `.rb` |
103+
| Rust | RS | `.rs` |
104+
| Scala | | `.scala` |
105+
| SQL | | `.sql` |
106+
| Swift | | `.swift` |
107+
| TOML | | `.toml` |
108+
| TSX | | `.tsx` |
109+
| TypeScript | TS | `.ts` |
110+
| XML | | `.xml` |
111+
| YAML | | `.yaml`, `.yml` |
112+
113+
114+
76115
## SentenceTransformerEmbed
77116

78117
`SentenceTransformerEmbed` embeds a text into a vector space using the [SentenceTransformer](https://huggingface.co/sentence-transformers) library.

src/ops/functions/split_recursively.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ static TREE_SITTER_LANGUAGE_BY_LANG: LazyLock<
108108
add_treesitter_language(
109109
&mut map,
110110
"C#",
111-
[".cs", "cs"],
111+
[".cs", "cs", "csharp"],
112112
tree_sitter_c_sharp::LANGUAGE,
113113
[],
114114
);

0 commit comments

Comments
 (0)