diff --git a/docs/docs/ops/functions.md b/docs/docs/ops/functions.md index eebd4650a..2bc05f29f 100644 --- a/docs/docs/ops/functions.md +++ b/docs/docs/ops/functions.md @@ -31,7 +31,7 @@ The spec takes the following fields: * `separators_regex` (`list[str]`): A list of regex patterns to split the text. Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`. - See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax. + See [regex syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax. Input data: @@ -57,9 +57,12 @@ Input data: We use the `language` field to determine how to split the input text, following these rules: - * We'll match the input `language` field against the `language_name` or `aliases` of each element of `custom_languages`, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`. - * If no match is found, we'll match the `language` field against the builtin language configurations. - For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code). + * We match the input `language` field against the following registries in the following order: + * `custom_languages` in the spec, against the `language_name` or `aliases` field of each entry. + * Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry. + + All matches are in a case-insensitive manner. If the value of `language` is null, it'll be treated as empty string. + * If no match is found, the input will be treated as plain text. ::: @@ -73,6 +76,42 @@ Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, w * `line` (*Int64*): The line number of the position. Starting from 1. * `column` (*Int64*): The column number of the position. Starting from 1. +### Supported Languages + +Currently, `SplitRecursively` supports the following languages: + +| Language | Aliases | File Extensions | +|----------|---------|-----------------| +| C | | `.c` | +| C++ | CPP | `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` | +| C# | CSharp, CS | `.cs` | +| CSS | | `.css`, `.scss` | +| DTD | | `.dtd` | +| Fortran | F, F90, F95, F03 | `.f`, `.f90`, `.f95`, `.f03` | +| Go | Golang | `.go` | +| HTML | | `.html`, `.htm` | +| Java | | `.java` | +| JavaScript | JS | `.js` | +| JSON | | `.json` | +| Kotlin | | `.kt`, `.kts` | +| Markdown | MD | `.md`, `.mdx` | +| Pascal | PAS, DPR, Delphi | `.pas`, `.dpr` | +| PHP | | `.php` | +| Python | | `.py` | +| R | | `.r` | +| Ruby | | `.rb` | +| Rust | RS | `.rs` | +| Scala | | `.scala` | +| SQL | | `.sql` | +| Swift | | `.swift` | +| TOML | | `.toml` | +| TSX | | `.tsx` | +| TypeScript | TS | `.ts` | +| XML | | `.xml` | +| YAML | | `.yaml`, `.yml` | + + + ## SentenceTransformerEmbed `SentenceTransformerEmbed` embeds a text into a vector space using the [SentenceTransformer](https://huggingface.co/sentence-transformers) library. diff --git a/src/ops/functions/split_recursively.rs b/src/ops/functions/split_recursively.rs index 30236ab9c..5204edcec 100644 --- a/src/ops/functions/split_recursively.rs +++ b/src/ops/functions/split_recursively.rs @@ -108,7 +108,7 @@ static TREE_SITTER_LANGUAGE_BY_LANG: LazyLock< add_treesitter_language( &mut map, "C#", - [".cs", "cs"], + [".cs", "cs", "csharp"], tree_sitter_c_sharp::LANGUAGE, [], );