Skip to content

Commit 64c6af3

Browse files
authored
docs(llm): add docs for text embedding. (#650)
1 parent ba758a1 commit 64c6af3

File tree

5 files changed

+145
-16
lines changed

5 files changed

+145
-16
lines changed

docs/docs/ai/llm.mdx

Lines changed: 113 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,63 @@ description: LLMs integrated with CocoIndex for various built-in functions
66
import Tabs from '@theme/Tabs';
77
import TabItem from '@theme/TabItem';
88

9-
CocoIndex provides builtin functions (e.g. [`ExtractByLlm`](/docs/ops/functions#extractbyllm)) that process data using LLM.
10-
You usually need to provide a `LlmSpec`, to configure the LLM integration you want to use and LLM models, etc.
9+
CocoIndex provides builtin functions integrating with various LLM APIs, for various inference tasks:
10+
* [Text Generation](#text-generation): use LLM to generate text.
11+
* [Text Embedding](#text-embedding): embed text into a vector space.
1112

13+
## LLM API Types
1214

13-
## LLM Spec
15+
We support integrating with LLM with different types of APIs.
16+
Each LLM API type is specified by a `cocoindex.LlmApiType` enum.
1417

15-
The `cocoindex.LlmSpec` data class is used to configure the LLM integration you want to use and LLM models, etc.
18+
We support the following types of LLM APIs:
19+
20+
| API Name | `LlmApiType` enum | Text Generation | Text Embedding |
21+
|----------|---------------------|--------------------|--------------------|
22+
| [OpenAI](#openai) | `LlmApiType.OPENAI` |||
23+
| [Ollama](#ollama) | `LlmApiType.OLLAMA` |||
24+
| [Google Gemini](#google-gemini) | `LlmApiType.GEMINI` |||
25+
| [Anthropic](#anthropic) | `LlmApiType.ANTHROPIC` |||
26+
| [Voyage](#voyage) | `LlmApiType.VOYAGE` |||
27+
| [LiteLLM](#litellm) | `LlmApiType.LITE_LLM` |||
28+
| [OpenRouter](#openrouter) | `LlmApiType.OPEN_ROUTER` |||
29+
30+
## LLM Tasks
31+
32+
### Text Generation
33+
34+
Generation is used as a building block for certain CocoIndex functions that process data using LLM generation.
35+
36+
We have one builtin functions using LLM generation for now:
37+
38+
* [`ExtractByLlm`](/docs/ops/functions#extractbyllm): it extracts information from input text.
39+
40+
#### LLM Spec
41+
42+
When calling a CocoIndex function that uses LLM generation, you need to provide a `cocoindex.LlmSpec` dataclass, to configure the LLM you want to use in these functions.
1643
It has the following fields:
1744

18-
* `api_type`: The type of integrated LLM API to use, e.g. `cocoindex.LlmApiType.OPENAI` or `cocoindex.LlmApiType.OLLAMA`.
45+
* `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of integrated LLM API to use, e.g. `cocoindex.LlmApiType.OPENAI` or `cocoindex.LlmApiType.OLLAMA`.
1946
See supported LLM APIs in the [LLM API integrations](#llm-api-integrations) section below.
20-
* `model`: The name of the LLM model to use.
21-
* `address` (optional): The address of the LLM API.
47+
* `model` (type: `str`, required): The name of the LLM model to use.
48+
* `address` (type: `str`, optional): The address of the LLM API.
2249

2350

51+
### Text Embedding
52+
53+
Embedding means converting text into a vector space, usually for similarity matching.
54+
55+
We provide a builtin function [`EmbedText`](/docs/ops/functions#embedtext) that converts a given text into a vector space.
56+
The spec takes the following fields:
57+
58+
* `api_type` (type: `cocoindex.LlmApiType`, required)
59+
* `model` (type: `str`, required)
60+
* `address` (type: `str`, optional)
61+
* `output_dimension` (type: `int`, optional)
62+
* `task_type` (type: `str`, optional)
63+
64+
See documentation for [`EmbedText`](/docs/ops/functions#embedtext) for more details about these fields.
65+
2466
## LLM API Integrations
2567

2668
CocoIndex integrates with various LLM APIs for these functions.
@@ -30,7 +72,11 @@ CocoIndex integrates with various LLM APIs for these functions.
3072
To use the OpenAI LLM API, you need to set the environment variable `OPENAI_API_KEY`.
3173
You can generate the API key from [OpenAI Dashboard](https://platform.openai.com/api-keys).
3274

33-
A spec for OpenAI looks like this:
75+
Currently we don't support custom address for OpenAI API.
76+
77+
You can find the full list of models supported by OpenAI [here](https://platform.openai.com/docs/models).
78+
79+
For text generation, a spec for OpenAI looks like this:
3480

3581
<Tabs>
3682
<TabItem value="python" label="Python" default>
@@ -42,9 +88,20 @@ cocoindex.LlmSpec(
4288
)
4389
```
4490

45-
Currently we don't support custom address for OpenAI API.
91+
</TabItem>
92+
</Tabs>
4693

47-
You can find the full list of models supported by OpenAI [here](https://platform.openai.com/docs/models).
94+
For text embedding, a spec for OpenAI looks like this:
95+
96+
<Tabs>
97+
<TabItem value="python" label="Python" default>
98+
99+
```python
100+
cocoindex.functions.EmbedText(
101+
api_type=cocoindex.LlmApiType.OPENAI,
102+
model="text-embedding-3-small",
103+
)
104+
```
48105

49106
</TabItem>
50107
</Tabs>
@@ -82,7 +139,9 @@ cocoindex.LlmSpec(
82139
To use the Gemini LLM API, you need to set the environment variable `GEMINI_API_KEY`.
83140
You can generate the API key from [Google AI Studio](https://aistudio.google.com/apikey).
84141

85-
A spec for Gemini looks like this:
142+
You can find the full list of models supported by Gemini [here](https://ai.google.dev/gemini-api/docs/models).
143+
144+
For text generation, a spec looks like this:
86145

87146
<Tabs>
88147
<TabItem value="python" label="Python" default>
@@ -97,14 +156,32 @@ cocoindex.LlmSpec(
97156
</TabItem>
98157
</Tabs>
99158

100-
You can find the full list of models supported by Gemini [here](https://ai.google.dev/gemini-api/docs/models).
159+
For text embedding, a spec looks like this:
160+
161+
<Tabs>
162+
<TabItem value="python" label="Python" default>
163+
164+
```python
165+
cocoindex.functions.EmbedText(
166+
api_type=cocoindex.LlmApiType.GEMINI,
167+
model="text-embedding-004",
168+
task_type="SEMANTICS_SIMILARITY",
169+
)
170+
```
171+
172+
All supported embedding models can be found [here](https://ai.google.dev/gemini-api/docs/embeddings#embeddings-models).
173+
Gemini supports task type (optional), which can be found [here](https://ai.google.dev/gemini-api/docs/embeddings#supported-task-types).
174+
175+
176+
</TabItem>
177+
</Tabs>
101178

102179
### Anthropic
103180

104181
To use the Anthropic LLM API, you need to set the environment variable `ANTHROPIC_API_KEY`.
105182
You can generate the API key from [Anthropic API](https://console.anthropic.com/settings/keys).
106183

107-
A spec for Anthropic looks like this:
184+
A text generation spec for Anthropic looks like this:
108185

109186
<Tabs>
110187
<TabItem value="python" label="Python" default>
@@ -121,6 +198,29 @@ cocoindex.LlmSpec(
121198

122199
You can find the full list of models supported by Anthropic [here](https://docs.anthropic.com/en/docs/about-claude/models/all-models).
123200

201+
### Voyage
202+
203+
To use the Voyage LLM API, you need to set the environment variable `VOYAGE_API_KEY`.
204+
You can generate the API key from [Voyage dashboard](https://dashboard.voyageai.com/organization/api-keys).
205+
206+
A text embedding spec for Voyage looks like this:
207+
208+
<Tabs>
209+
<TabItem value="python" label="Python" default>
210+
211+
```python
212+
cocoindex.functions.EmbedText(
213+
api_type=cocoindex.LlmApiType.VOYAGE,
214+
model="voyage-code-3",
215+
task_type="document",
216+
)
217+
```
218+
219+
</TabItem>
220+
</Tabs>
221+
222+
Voyage API supports `document` and `query` as task types (optional, a.k.a. `input_type` in Voyage API, see [Voyage API documentation](https://docs.voyageai.com/reference/embeddings-api) for details).
223+
124224
### LiteLLM
125225

126226
To use the LiteLLM API, you need to set the environment variable `LITELLM_API_KEY`.

docs/docs/ops/functions.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,3 +105,32 @@ Input data:
105105
* `text` (type: `str`, required): The text to extract information from.
106106

107107
Return type: As specified by the `output_type` field in the spec. The extracted information from the input text.
108+
109+
## EmbedText
110+
111+
`EmbedText` embeds a text into a vector space using various LLM APIs that support text embedding.
112+
113+
The spec takes the following fields:
114+
115+
* `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of LLM API to use for embedding.
116+
* `model` (type: `str`, required): The name of the embedding model to use.
117+
* `address` (type: `str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
118+
* `output_dimension` (type: `int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
119+
120+
For most API types, the function internally keeps a registry for the default output dimension of known model.
121+
You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet.
122+
123+
* `task_type` (type: `str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
124+
125+
:::note Supported APIs for Text Embedding
126+
127+
Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/llm#llm-api-types) for which APIs support text embedding functionality.
128+
129+
:::
130+
131+
Input data:
132+
133+
* `text` (type: `str`, required): The text to embed.
134+
135+
Return type: `vector[float32; N]`, where `N` is the dimension of the embedding vector determined by the model.
136+

docs/sidebars.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,4 +61,4 @@ const sidebars: SidebarsConfig = {
6161
],
6262
};
6363

64-
export default sidebars;
64+
export default sidebars;

examples/code_embedding/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def code_to_embedding(
2424
# You can also switch to Voyage embedding model:
2525
# return text.transform(
2626
# cocoindex.functions.EmbedText(
27-
# api_type=cocoindex.llm.LlmApiType.VOYAGE,
27+
# api_type=cocoindex.LlmApiType.VOYAGE,
2828
# model="voyage-code-3",
2929
# )
3030
# )

examples/text_embedding/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ def text_to_embedding(
1919
# You can also switch to remote embedding model:
2020
# return text.transform(
2121
# cocoindex.functions.EmbedText(
22-
# api_type=cocoindex.llm.LlmApiType.OPENAI,
22+
# api_type=cocoindex.LlmApiType.OPENAI,
2323
# model="text-embedding-3-small",
2424
# )
2525
# )

0 commit comments

Comments
 (0)