You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*`language` (*Optional[Str]*, default: `"json"`): The language of the source text. Only `json` is supported now.
14
+
*`text` (*Str*): The source text to parse.
15
+
*`language` (*Optional[Str]*, default: `"json"`): The language of the source text. Only `json` is supported now.
16
16
17
17
Return: *Json*, the parsed JSON object.
18
18
19
+
## DetectProgrammingLanguage
20
+
21
+
`DetectProgrammingLanguage` detects the programming language of a file based on its filename extension.
22
+
23
+
Input data:
24
+
25
+
*`filename` (*Str*): The filename (with extension) to detect the language for.
26
+
27
+
Return: *Str* or *Null*. Returns the programming language name if the file extension is recognized, or *Null* if the extension is not supported.
28
+
29
+
The returned string values match the language name listed in [`tree-sitter-language-pack`](https://github.com/Goldziher/tree-sitter-language-pack?tab=readme-ov-file#available-languages).
30
+
19
31
## SplitRecursively
20
32
21
33
`SplitRecursively` splits a document into chunks of a given size.
@@ -24,20 +36,20 @@ For example, for a Markdown file, it identifies boundaries in this order: level-
24
36
25
37
The spec takes the following fields:
26
38
27
-
*`custom_languages` (`list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
28
-
*`language_name` (`str`): Name of the language.
29
-
*`aliases` (`list[str]`, optional): A list of aliases for the language.
39
+
*`custom_languages` (`list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
40
+
*`language_name` (`str`): Name of the language.
41
+
*`aliases` (`list[str]`, optional): A list of aliases for the language.
30
42
It's an error if any language name or alias is duplicated.
31
43
32
-
*`separators_regex` (`list[str]`): A list of regex patterns to split the text.
44
+
*`separators_regex` (`list[str]`): A list of regex patterns to split the text.
33
45
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
34
46
See [regex syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
35
47
36
48
Input data:
37
49
38
-
*`text` (*Str*): The text to split.
39
-
*`chunk_size` (*Int64*): The maximum size of each chunk, in bytes.
40
-
*`min_chunk_size` (*Int64*, default: `chunk_size / 2`): The minimum size of each chunk, in bytes.
50
+
*`text` (*Str*): The text to split.
51
+
*`chunk_size` (*Int64*): The maximum size of each chunk, in bytes.
52
+
*`min_chunk_size` (*Int64*, default: `chunk_size / 2`): The minimum size of each chunk, in bytes.
41
53
42
54
:::note
43
55
@@ -48,70 +60,67 @@ Input data:
48
60
49
61
:::
50
62
51
-
*`chunk_overlap` (*Optional[Int64]*, default: *None*): The maximum overlap size between adjacent chunks, in bytes.
52
-
*`language` (*Str*, default: `""`): The language of the document.
53
-
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
54
-
63
+
*`chunk_overlap` (*Optional[Int64]*, default: *None*): The maximum overlap size between adjacent chunks, in bytes.
64
+
*`language` (*Str*, default: `""`): The language of the document.
65
+
Can be a language name (e.g. `python`, `javascript`, `markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
55
66
56
67
:::note
57
68
58
69
We use the `language` field to determine how to split the input text, following these rules:
59
70
60
-
* We match the input `language` field against the following registries in the following order:
61
-
* `custom_languages` in the spec, against the `language_name` or `aliases` field of each entry.
62
-
* Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry.
71
+
* We match the input `language` field against the following registries in the following order:
72
+
*`custom_languages` in the spec, against the `language_name` or `aliases` field of each entry.
73
+
* Builtin languages (see [Supported Languages](#supported-languages) section below), against the language, aliases or file extensions of each entry.
63
74
64
75
All matches are in a case-insensitive manner.
65
76
66
-
* If no match is found, the input will be treated as plain text.
77
+
* If no match is found, the input will be treated as plain text.
67
78
68
79
:::
69
80
70
81
Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
71
82
72
-
*`location` (*Range*): The location of the chunk.
73
-
*`text` (*Str*): The text of the chunk.
74
-
*`start` / `end` (*Struct*): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
75
-
*`offset` (*Int64*): The byte offset of the position.
76
-
*`line` (*Int64*): The line number of the position. Starting from 1.
77
-
*`column` (*Int64*): The column number of the position. Starting from 1.
83
+
*`location` (*Range*): The location of the chunk.
84
+
*`text` (*Str*): The text of the chunk.
85
+
*`start` / `end` (*Struct*): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
86
+
*`offset` (*Int64*): The byte offset of the position.
87
+
*`line` (*Int64*): The line number of the position. Starting from 1.
88
+
*`column` (*Int64*): The column number of the position. Starting from 1.
78
89
79
90
### Supported Languages
80
91
81
92
Currently, `SplitRecursively` supports the following languages:
@@ -124,41 +133,42 @@ This function requires the 'sentence-transformers' library, which is an optional
124
133
```bash
125
134
pip install 'cocoindex[embeddings]'
126
135
```
136
+
127
137
:::
128
138
129
139
The spec takes the following fields:
130
140
131
-
*`model` (`str`): The name of the SentenceTransformer model to use.
132
-
*`args` (`dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
141
+
*`model` (`str`): The name of the SentenceTransformer model to use.
142
+
*`args` (`dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
133
143
134
144
Input data:
135
145
136
-
*`text` (*Str*): The text to embed.
146
+
*`text` (*Str*): The text to embed.
137
147
138
148
Return: *Vector[Float32, N]*, where *N* is determined by the model
139
149
140
150
## ExtractByLlm
141
151
142
152
`ExtractByLlm` extracts structured information from a text using specified LLM. The spec takes the following fields:
143
153
144
-
*`llm_spec` (`cocoindex.LlmSpec`): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
145
-
*`output_type` (`type`): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
146
-
*`instruction` (`str`, optional): Additional instruction for the LLM.
154
+
*`llm_spec` (`cocoindex.LlmSpec`): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
155
+
*`output_type` (`type`): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
156
+
*`instruction` (`str`, optional): Additional instruction for the LLM.
147
157
148
158
:::tip Clear type definitions
149
159
150
160
Definitions of the `output_type` is fed into LLM as guidance to generate the output.
151
161
To improve the quality of the extracted information, giving clear definitions for your dataclasses is especially important, e.g.
152
162
153
-
*Provide readable field names for your dataclasses.
154
-
*Provide reasonable docstrings for your dataclasses.
155
-
*For any optional fields, clearly annotate that they are optional, by `SomeType | None` or `typing.Optional[SomeType]`.
163
+
* Provide readable field names for your dataclasses.
164
+
* Provide reasonable docstrings for your dataclasses.
165
+
* For any optional fields, clearly annotate that they are optional, by `SomeType | None` or `typing.Optional[SomeType]`.
156
166
157
167
:::
158
168
159
169
Input data:
160
170
161
-
*`text` (*Str*): The text to extract information from.
171
+
*`text` (*Str*): The text to extract information from.
162
172
163
173
Return: As specified by the `output_type` field in the spec. The extracted information from the input text.
164
174
@@ -168,15 +178,15 @@ Return: As specified by the `output_type` field in the spec. The extracted infor
168
178
169
179
The spec takes the following fields:
170
180
171
-
*`api_type` ([`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types)): The type of LLM API to use for embedding.
172
-
*`model` (`str`): The name of the embedding model to use.
173
-
*`address` (`str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
174
-
*`output_dimension` (`int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
181
+
*`api_type` ([`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types)): The type of LLM API to use for embedding.
182
+
*`model` (`str`): The name of the embedding model to use.
183
+
*`address` (`str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
184
+
*`output_dimension` (`int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
175
185
176
186
For most API types, the function internally keeps a registry for the default output dimension of known model.
177
187
You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet.
178
188
179
-
*`task_type` (`str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
189
+
*`task_type` (`str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
180
190
181
191
:::note Supported APIs for Text Embedding
182
192
@@ -186,18 +196,18 @@ Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/
186
196
187
197
Input data:
188
198
189
-
*`text` (*Str*): The text to embed.
199
+
*`text` (*Str*): The text to embed.
190
200
191
201
Return: *Vector[Float32, N]*, where *N* is the dimension of the embedding vector determined by the model.
192
202
193
203
## ColPali Functions
194
204
195
205
ColPali functions enable multimodal document retrieval using ColVision models. These functions support ALL models available in the [colpali-engine library](https://github.com/illuin-tech/colpali), including:
196
206
197
-
-**ColPali models** (colpali-*): PaliGemma-based, best for general document retrieval
198
-
-**ColQwen2 models** (colqwen-*): Qwen2-VL-based, excellent for multilingual text (29+ languages) and general vision
199
-
-**ColSmol models** (colsmol-*): Lightweight, good for resource-constrained environments
200
-
- Any future ColVision models supported by colpali-engine
207
+
***ColPali models** (colpali-*): PaliGemma-based, best for general document retrieval
208
+
***ColQwen2 models** (colqwen-*): Qwen2-VL-based, excellent for multilingual text (29+ languages) and general vision
209
+
***ColSmol models** (colsmol-*): Lightweight, good for resource-constrained environments
210
+
* Any future ColVision models supported by colpali-engine
201
211
202
212
These models use late interaction between image patch embeddings and text token embeddings for retrieval.
203
213
@@ -208,6 +218,7 @@ These functions require the `colpali-engine` library, which is an optional depen
*`model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
230
+
*`model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
220
231
221
232
Input data:
222
233
223
-
*`img_bytes` (*Bytes*): The image data in bytes format.
234
+
*`img_bytes` (*Bytes*): The image data in bytes format.
224
235
225
236
Return: *Vector[Vector[Float32, N]]*, where *N* is the hidden dimension determined by the model. This returns a multi-vector format with variable patches and fixed hidden dimension.
226
237
@@ -232,10 +243,10 @@ This produces query embeddings compatible with ColVision image embeddings for la
232
243
233
244
The spec takes the following fields:
234
245
235
-
*`model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
246
+
*`model` (`str`): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the [complete list of supported models](https://github.com/illuin-tech/colpali#list-of-colvision-models).
236
247
237
248
Input data:
238
249
239
-
*`query` (*Str*): The text query to embed.
250
+
*`query` (*Str*): The text query to embed.
240
251
241
252
Return: *Vector[Vector[Float32, N]]*, where *N* is the hidden dimension determined by the model. This returns a multi-vector format with variable tokens and fixed hidden dimension.
0 commit comments