Skip to content

Commit fdb0650

Browse files
authored
feat: add DetectProgrammingLanguage function (#1165)
1 parent c4f24b1 commit fdb0650

File tree

8 files changed

+233
-101
lines changed

8 files changed

+233
-101
lines changed

docs/docs/ops/functions.md

Lines changed: 88 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,23 @@ description: CocoIndex Built-in Functions
1111

1212
Input data:
1313

14-
* `text` (*Str*): The source text to parse.
15-
* `language` (*Optional[Str]*, default: `"json"`): The language of the source text. Only `json` is supported now.
14+
* `text` (*Str*): The source text to parse.
15+
* `language` (*Optional[Str]*, default: `"json"`): The language of the source text. Only `json` is supported now.
1616

1717
Return: *Json*, the parsed JSON object.
1818

19+
## DetectProgrammingLanguage
20+
21+
`DetectProgrammingLanguage` detects the programming language of a file based on its filename extension.
22+
23+
Input data:
24+
25+
* `filename` (*Str*): The filename (with extension) to detect the language for.
26+
27+
Return: *Str* or *Null*. Returns the programming language name if the file extension is recognized, or *Null* if the extension is not supported.
28+
29+
The returned string values match the language name listed in [`tree-sitter-language-pack`](https://github.com/Goldziher/tree-sitter-language-pack?tab=readme-ov-file#available-languages).
30+
1931
## SplitRecursively
2032

2133
`SplitRecursively` splits a document into chunks of a given size.
@@ -24,20 +36,20 @@ For example, for a Markdown file, it identifies boundaries in this order: level-
2436

2537
The spec takes the following fields:
2638

27-
* `custom_languages` (`list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
28-
* `language_name` (`str`): Name of the language.
29-
* `aliases` (`list[str]`, optional): A list of aliases for the language.
39+
* `custom_languages` (`list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
40+
* `language_name` (`str`): Name of the language.
41+
* `aliases` (`list[str]`, optional): A list of aliases for the language.
3042
It's an error if any language name or alias is duplicated.
3143

32-
* `separators_regex` (`list[str]`): A list of regex patterns to split the text.
44+
* `separators_regex` (`list[str]`): A list of regex patterns to split the text.
3345
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
3446
See [regex syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
3547

3648
Input data:
3749

38-
* `text` (*Str*): The text to split.
39-
* `chunk_size` (*Int64*): The maximum size of each chunk, in bytes.
40-
* `min_chunk_size` (*Int64*, default: `chunk_size / 2`): The minimum size of each chunk, in bytes.
50+
* `text` (*Str*): The text to split.
51+
* `chunk_size` (*Int64*): The maximum size of each chunk, in bytes.
52+
* `min_chunk_size` (*Int64*, default: `chunk_size / 2`): The minimum size of each chunk, in bytes.
4153

4254
:::note
4355

@@ -48,70 +60,67 @@ Input data:
4860

4961
:::
5062

51-
* `chunk_overlap` (*Optional[Int64]*, default: *None*): The maximum overlap size between adjacent chunks, in bytes.
52-
* `language` (*Str*, default: `""`): The language of the document.
53-
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
54-
63+
* `chunk_overlap` (*Optional[Int64]*, default: *None*): The maximum overlap size between adjacent chunks, in bytes.
64+
* `language` (*Str*, default: `""`): The language of the document.
65+
Can be a language name (e.g. `python`, `javascript`, `markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
5566

5667
:::note
5768

5869
We use the `language` field to determine how to split the input text, following these rules:
5970

60-
* We match the input `language` field against the following registries in the following order:
61-
* `custom_languages` in the spec, against the `language_name` or `aliases` field of each entry.
62-
* Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry.
71+
* We match the input `language` field against the following registries in the following order:
72+
* `custom_languages` in the spec, against the `language_name` or `aliases` field of each entry.
73+
* Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry.
6374

6475
All matches are in a case-insensitive manner.
6576

66-
* If no match is found, the input will be treated as plain text.
77+
* If no match is found, the input will be treated as plain text.
6778

6879
:::
6980

7081
Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
7182

72-
* `location` (*Range*): The location of the chunk.
73-
* `text` (*Str*): The text of the chunk.
74-
* `start` / `end` (*Struct*): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
75-
* `offset` (*Int64*): The byte offset of the position.
76-
* `line` (*Int64*): The line number of the position. Starting from 1.
77-
* `column` (*Int64*): The column number of the position. Starting from 1.
83+
* `location` (*Range*): The location of the chunk.
84+
* `text` (*Str*): The text of the chunk.
85+
* `start` / `end` (*Struct*): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
86+
* `offset` (*Int64*): The byte offset of the position.
87+
* `line` (*Int64*): The line number of the position. Starting from 1.
88+
* `column` (*Int64*): The column number of the position. Starting from 1.
7889

7990
### Supported Languages
8091

8192
Currently, `SplitRecursively` supports the following languages:
8293

8394
| Language | Aliases | File Extensions |
8495
|----------|---------|-----------------|
85-
| C | | `.c` |
86-
| C++ | CPP | `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` |
87-
| C# | CSharp, CS | `.cs` |
88-
| CSS | | `.css`, `.scss` |
89-
| DTD | | `.dtd` |
90-
| Fortran | F, F90, F95, F03 | `.f`, `.f90`, `.f95`, `.f03` |
91-
| Go | Golang | `.go` |
92-
| HTML | | `.html`, `.htm` |
93-
| Java | | `.java` |
94-
| JavaScript | JS | `.js` |
95-
| JSON | | `.json` |
96-
| Kotlin | | `.kt`, `.kts` |
97-
| Markdown | MD | `.md`, `.mdx` |
98-
| Pascal | PAS, DPR, Delphi | `.pas`, `.dpr` |
99-
| PHP | | `.php` |
100-
| Python | | `.py` |
101-
| R | | `.r` |
102-
| Ruby | | `.rb` |
103-
| Rust | RS | `.rs` |
104-
| Scala | | `.scala` |
105-
| Solidity | | `.sol` |
106-
| SQL | | `.sql` |
107-
| Swift | | `.swift` |
108-
| TOML | | `.toml` |
109-
| TSX | | `.tsx` |
110-
| TypeScript | TS | `.ts` |
111-
| XML | | `.xml` |
112-
| YAML | | `.yaml`, `.yml` |
113-
114-
96+
| c | | `.c` |
97+
| cpp | c++ | `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` |
98+
| csharp | csharp, cs | `.cs` |
99+
| css | | `.css`, `.scss` |
100+
| dtd | | `.dtd` |
101+
| fortran | f, f90, f95, f03 | `.f`, `.f90`, `.f95`, `.f03` |
102+
| go | golang | `.go` |
103+
| html | | `.html`, `.htm` |
104+
| java | | `.java` |
105+
| javascript | js | `.js` |
106+
| json | | `.json` |
107+
| kotlin | | `.kt`, `.kts` |
108+
| markdown | md | `.md`, `.mdx` |
109+
| pascal | pas, dpr, delphi | `.pas`, `.dpr` |
110+
| php | | `.php` |
111+
| python | | `.py` |
112+
| r | | `.r` |
113+
| ruby | | `.rb` |
114+
| rust | rs | `.rs` |
115+
| scala | | `.scala` |
116+
| solidity | | `.sol` |
117+
| sql | | `.sql` |
118+
| swift | | `.swift` |
119+
| toml | | `.toml` |
120+
| tsx | | `.tsx` |
121+
| typescript | ts | `.ts` |
122+
| xml | | `.xml` |
123+
| yaml | | `.yaml`, `.yml` |
115124

116125
## SentenceTransformerEmbed
117126

@@ -124,41 +133,42 @@ This function requires the 'sentence-transformers' library, which is an optional
124133
```bash
125134
pip install 'cocoindex[embeddings]'
126135
```
136+
127137
:::
128138

129139
The spec takes the following fields:
130140

131-
* `model` (`str`): The name of the SentenceTransformer model to use.
132-
* `args` (`dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
141+
* `model` (`str`): The name of the SentenceTransformer model to use.
142+
* `args` (`dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
133143

134144
Input data:
135145

136-
* `text` (*Str*): The text to embed.
146+
* `text` (*Str*): The text to embed.
137147

138148
Return: *Vector[Float32, N]*, where *N* is determined by the model
139149

140150
## ExtractByLlm
141151

142152
`ExtractByLlm` extracts structured information from a text using specified LLM. The spec takes the following fields:
143153

144-
* `llm_spec` (`cocoindex.LlmSpec`): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
145-
* `output_type` (`type`): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
146-
* `instruction` (`str`, optional): Additional instruction for the LLM.
154+
* `llm_spec` (`cocoindex.LlmSpec`): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
155+
* `output_type` (`type`): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
156+
* `instruction` (`str`, optional): Additional instruction for the LLM.
147157

148158
:::tip Clear type definitions
149159

150160
Definitions of the `output_type` is fed into LLM as guidance to generate the output.
151161
To improve the quality of the extracted information, giving clear definitions for your dataclasses is especially important, e.g.
152162

153-
* Provide readable field names for your dataclasses.
154-
* Provide reasonable docstrings for your dataclasses.
155-
* For any optional fields, clearly annotate that they are optional, by `SomeType | None` or `typing.Optional[SomeType]`.
163+
* Provide readable field names for your dataclasses.
164+
* Provide reasonable docstrings for your dataclasses.
165+
* For any optional fields, clearly annotate that they are optional, by `SomeType | None` or `typing.Optional[SomeType]`.
156166

157167
:::
158168

159169
Input data:
160170

161-
* `text` (*Str*): The text to extract information from.
171+
* `text` (*Str*): The text to extract information from.
162172

163173
Return: As specified by the `output_type` field in the spec. The extracted information from the input text.
164174

@@ -168,15 +178,15 @@ Return: As specified by the `output_type` field in the spec. The extracted infor
168178

169179
The spec takes the following fields:
170180

171-
* `api_type` ([`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types)): The type of LLM API to use for embedding.
172-
* `model` (`str`): The name of the embedding model to use.
173-
* `address` (`str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
174-
* `output_dimension` (`int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
181+
* `api_type` ([`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types)): The type of LLM API to use for embedding.
182+
* `model` (`str`): The name of the embedding model to use.
183+
* `address` (`str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
184+
* `output_dimension` (`int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
175185

176186
For most API types, the function internally keeps a registry for the default output dimension of known model.
177187
You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet.
178188

179-
* `task_type` (`str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
189+
* `task_type` (`str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
180190

181191
:::note Supported APIs for Text Embedding
182192

@@ -186,18 +196,18 @@ Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/
186196

187197
Input data:
188198

189-
* `text` (*Str*): The text to embed.
199+
* `text` (*Str*): The text to embed.
190200

191201
Return: *Vector[Float32, N]*, where *N* is the dimension of the embedding vector determined by the model.
192202

193203
## ColPali Functions
194204

195205
ColPali functions enable multimodal document retrieval using ColVision models. These functions support ALL models available in the [colpali-engine library](https://github.com/illuin-tech/colpali), including:
196206

197-
- **ColPali models** (colpali-*): PaliGemma-based, best for general document retrieval
198-
- **ColQwen2 models** (colqwen-*): Qwen2-VL-based, excellent for multilingual text (29+ languages) and general vision
199-
- **ColSmol models** (colsmol-*): Lightweight, good for resource-constrained environments
200-
- Any future ColVision models supported by colpali-engine
207+
* **ColPali models** (colpali-*): PaliGemma-based, best for general document retrieval
208+
* **ColQwen2 models** (colqwen-*): Qwen2-VL-based, excellent for multilingual text (29+ languages) and general vision
209+
* **ColSmol models** (colsmol-*): Lightweight, good for resource-constrained environments
210+
* Any future ColVision models supported by colpali-engine
201211

202212
These models use late interaction between image patch embeddings and text token embeddings for retrieval.
203213

@@ -208,6 +218,7 @@ These functions require the `colpali-engine` library, which is an optional depen
208218
```bash
209219
pip install 'cocoindex[colpali]'
210220
```
221+
211222
:::
212223

213224
### ColPaliEmbedImage
@@ -216,11 +227,11 @@ pip install 'cocoindex[colpali]'
216227

217228
The spec takes the following fields:
218229

219-
* `model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
230+
* `model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
220231

221232
Input data:
222233

223-
* `img_bytes` (*Bytes*): The image data in bytes format.
234+
* `img_bytes` (*Bytes*): The image data in bytes format.
224235

225236
Return: *Vector[Vector[Float32, N]]*, where *N* is the hidden dimension determined by the model. This returns a multi-vector format with variable patches and fixed hidden dimension.
226237

@@ -232,10 +243,10 @@ This produces query embeddings compatible with ColVision image embeddings for la
232243

233244
The spec takes the following fields:
234245

235-
* `model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
246+
* `model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
236247

237248
Input data:
238249

239-
* `query` (*Str*): The text query to embed.
250+
* `query` (*Str*): The text query to embed.
240251

241252
Return: *Vector[Vector[Float32, N]]*, where *N* is the hidden dimension determined by the model. This returns a multi-vector format with variable tokens and fixed hidden dimension.

examples/code_embedding/main.py

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,13 @@
11
from dotenv import load_dotenv
22
from psycopg_pool import ConnectionPool
33
from pgvector.psycopg import register_vector
4-
from typing import Any
54
import functools
65
import cocoindex
76
import os
87
from numpy.typing import NDArray
98
import numpy as np
109

1110

12-
@cocoindex.op.function()
13-
def extract_extension(filename: str) -> str:
14-
"""Extract the extension of a filename."""
15-
return os.path.splitext(filename)[1]
16-
17-
1811
@cocoindex.transform_flow()
1912
def code_to_embedding(
2013
text: cocoindex.DataSlice[str],
@@ -53,10 +46,12 @@ def code_embedding_flow(
5346
code_embeddings = data_scope.add_collector()
5447

5548
with data_scope["files"].row() as file:
56-
file["extension"] = file["filename"].transform(extract_extension)
49+
file["language"] = file["filename"].transform(
50+
cocoindex.functions.DetectProgrammingLanguage()
51+
)
5752
file["chunks"] = file["content"].transform(
5853
cocoindex.functions.SplitRecursively(),
59-
language=file["extension"],
54+
language=file["language"],
6055
chunk_size=1000,
6156
min_chunk_size=300,
6257
chunk_overlap=300,

python/cocoindex/functions/__init__.py

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,7 @@
55
"""
66

77
# Import all engine builtin function specs
8-
from ._engine_builtin_specs import (
9-
ParseJson,
10-
SplitRecursively,
11-
SplitBySeparators,
12-
EmbedText,
13-
ExtractByLlm,
14-
)
8+
from ._engine_builtin_specs import *
159

1610
# Import SentenceTransformer embedding functionality
1711
from .sbert import (
@@ -29,11 +23,12 @@
2923

3024
__all__ = [
3125
# Engine builtin specs
32-
"ParseJson",
33-
"SplitRecursively",
34-
"SplitBySeparators",
26+
"DetectProgrammingLanguage",
3527
"EmbedText",
3628
"ExtractByLlm",
29+
"ParseJson",
30+
"SplitBySeparators",
31+
"SplitRecursively",
3732
# SentenceTransformer
3833
"SentenceTransformerEmbed",
3934
"SentenceTransformerEmbedExecutor",

python/cocoindex/functions/_engine_builtin_specs.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ class CustomLanguageSpec:
1919
aliases: list[str] = dataclasses.field(default_factory=list)
2020

2121

22+
class DetectProgrammingLanguage(op.FunctionSpec):
23+
"""Detect the programming language of a file."""
24+
25+
2226
class SplitRecursively(op.FunctionSpec):
2327
"""Split a document (in string) recursively."""
2428

0 commit comments

Comments
 (0)