docs(llm): add docs for text embedding. (#650)

badmonster0 · web-flow · commit 64c6af3237c7 · 2025-06-22T20:15:05.000-07:00
diff --git a/docs/docs/ai/llm.mdx b/docs/docs/ai/llm.mdx
@@ -6,21 +6,63 @@ description: LLMs integrated with CocoIndex for various built-in functions
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-CocoIndex provides builtin functions (e.g. [`ExtractByLlm`](/docs/ops/functions#extractbyllm)) that process data using LLM.
-You usually need to provide a `LlmSpec`, to configure the LLM integration you want to use and LLM models, etc.
+CocoIndex provides builtin functions integrating with various LLM APIs, for various inference tasks:
+*   [Text Generation](#text-generation): use LLM to generate text.
+*   [Text Embedding](#text-embedding): embed text into a vector space.
 
+## LLM API Types
 
-## LLM Spec
+We support integrating with LLM with different types of APIs.
+Each LLM API type is specified by a `cocoindex.LlmApiType` enum.
 
-The `cocoindex.LlmSpec` data class is used to configure the LLM integration you want to use and LLM models, etc.
+We support the following types of LLM APIs:
+
+| API Name | `LlmApiType` enum | Text Generation | Text Embedding |
+|----------|---------------------|--------------------|--------------------|
+| [OpenAI](#openai) | `LlmApiType.OPENAI` | ✅ | ✅ |
+| [Ollama](#ollama) | `LlmApiType.OLLAMA` | ✅ | ❌ |
+| [Google Gemini](#google-gemini) | `LlmApiType.GEMINI` | ✅ | ✅ |
+| [Anthropic](#anthropic) | `LlmApiType.ANTHROPIC` | ✅ | ❌ |
+| [Voyage](#voyage) | `LlmApiType.VOYAGE` | ❌ | ✅ |
+| [LiteLLM](#litellm) | `LlmApiType.LITE_LLM` | ✅ | ❌ |
+| [OpenRouter](#openrouter) | `LlmApiType.OPEN_ROUTER` | ✅ | ❌ |
+
+## LLM Tasks
+
+### Text Generation
+
+Generation is used as a building block for certain CocoIndex functions that process data using LLM generation.
+
+We have one builtin functions using LLM generation for now:
+
+*  [`ExtractByLlm`](/docs/ops/functions#extractbyllm): it extracts information from input text.
+
+#### LLM Spec
+
+When calling a CocoIndex function that uses LLM generation, you need to provide a `cocoindex.LlmSpec` dataclass, to configure the LLM you want to use in these functions.
 It has the following fields:
 
-*   `api_type`: The type of integrated LLM API to use, e.g. `cocoindex.LlmApiType.OPENAI` or `cocoindex.LlmApiType.OLLAMA`.
+*   `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of integrated LLM API to use, e.g. `cocoindex.LlmApiType.OPENAI` or `cocoindex.LlmApiType.OLLAMA`.
     See supported LLM APIs in the [LLM API integrations](#llm-api-integrations) section below.
-*   `model`: The name of the LLM model to use.
-*   `address` (optional): The address of the LLM API.
+*   `model` (type: `str`, required): The name of the LLM model to use.
+*   `address` (type: `str`, optional): The address of the LLM API.
 
 
+### Text Embedding
+
+Embedding means converting text into a vector space, usually for similarity matching.
+
+We provide a builtin function [`EmbedText`](/docs/ops/functions#embedtext) that converts a given text into a vector space. 
+The spec takes the following fields:
+
+*   `api_type` (type: `cocoindex.LlmApiType`, required)
+*   `model` (type: `str`, required)
+*   `address` (type: `str`, optional)
+*   `output_dimension` (type: `int`, optional)
+*   `task_type` (type: `str`, optional)
+
+See documentation for [`EmbedText`](/docs/ops/functions#embedtext) for more details about these fields.
+
 ## LLM API Integrations
 
 CocoIndex integrates with various LLM APIs for these functions.
@@ -30,7 +72,11 @@ CocoIndex integrates with various LLM APIs for these functions.
 To use the OpenAI LLM API, you need to set the environment variable `OPENAI_API_KEY`.
 You can generate the API key from [OpenAI Dashboard](https://platform.openai.com/api-keys).
 
-A spec for OpenAI looks like this:
+Currently we don't support custom address for OpenAI API.
+
+You can find the full list of models supported by OpenAI [here](https://platform.openai.com/docs/models).
+
+For text generation, a spec for OpenAI looks like this:
 
 <Tabs>
 <TabItem value="python" label="Python" default>
@@ -42,9 +88,20 @@ cocoindex.LlmSpec(
 )
 ```
 
-Currently we don't support custom address for OpenAI API.
+</TabItem>
+</Tabs>
 
-You can find the full list of models supported by OpenAI [here](https://platform.openai.com/docs/models).
+For text embedding, a spec for OpenAI looks like this:
+
+<Tabs>
+<TabItem value="python" label="Python" default>
+
+```python
+cocoindex.functions.EmbedText(
+    api_type=cocoindex.LlmApiType.OPENAI,
+    model="text-embedding-3-small",
+)
+```
 
 </TabItem>
 </Tabs>
@@ -82,7 +139,9 @@ cocoindex.LlmSpec(
 To use the Gemini LLM API, you need to set the environment variable `GEMINI_API_KEY`.
 You can generate the API key from [Google AI Studio](https://aistudio.google.com/apikey).
 
-A spec for Gemini looks like this:
+You can find the full list of models supported by Gemini [here](https://ai.google.dev/gemini-api/docs/models).
+
+For text generation, a spec looks like this:
 
 <Tabs>
 <TabItem value="python" label="Python" default>
@@ -97,14 +156,32 @@ cocoindex.LlmSpec(
 </TabItem>
 </Tabs>
 
-You can find the full list of models supported by Gemini [here](https://ai.google.dev/gemini-api/docs/models).
+For text embedding, a spec looks like this:
+
+<Tabs>
+<TabItem value="python" label="Python" default>
+
+```python
+cocoindex.functions.EmbedText(
+    api_type=cocoindex.LlmApiType.GEMINI,
+    model="text-embedding-004",
+    task_type="SEMANTICS_SIMILARITY",
+)
+```
+
+All supported embedding models can be found [here](https://ai.google.dev/gemini-api/docs/embeddings#embeddings-models).
+Gemini supports task type (optional), which can be found [here](https://ai.google.dev/gemini-api/docs/embeddings#supported-task-types).
+
+
+</TabItem>
+</Tabs>
 
 ### Anthropic
 
 To use the Anthropic LLM API, you need to set the environment variable `ANTHROPIC_API_KEY`.
 You can generate the API key from [Anthropic API](https://console.anthropic.com/settings/keys).
 
-A spec for Anthropic looks like this:
+A text generation spec for Anthropic looks like this:
 
 <Tabs>
 <TabItem value="python" label="Python" default>
@@ -121,6 +198,29 @@ cocoindex.LlmSpec(
 
 You can find the full list of models supported by Anthropic [here](https://docs.anthropic.com/en/docs/about-claude/models/all-models).
 
+### Voyage
+
+To use the Voyage LLM API, you need to set the environment variable `VOYAGE_API_KEY`.
+You can generate the API key from [Voyage dashboard](https://dashboard.voyageai.com/organization/api-keys).
+
+A text embedding spec for Voyage looks like this:
+
+<Tabs>
+<TabItem value="python" label="Python" default>
+
+```python
+cocoindex.functions.EmbedText(
+    api_type=cocoindex.LlmApiType.VOYAGE,
+    model="voyage-code-3",
+    task_type="document",
+)
+```
+
+</TabItem>
+</Tabs>
+
+Voyage API supports `document` and `query` as task types (optional, a.k.a. `input_type` in Voyage API, see [Voyage API documentation](https://docs.voyageai.com/reference/embeddings-api) for details).
+
 ### LiteLLM
 
 To use the LiteLLM API, you need to set the environment variable `LITELLM_API_KEY`.
diff --git a/docs/docs/ops/functions.md b/docs/docs/ops/functions.md
@@ -105,3 +105,32 @@ Input data:
 *   `text` (type: `str`, required): The text to extract information from.
 
 Return type: As specified by the `output_type` field in the spec. The extracted information from the input text.
+
+## EmbedText
+
+`EmbedText` embeds a text into a vector space using various LLM APIs that support text embedding.
+
+The spec takes the following fields:
+
+*   `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of LLM API to use for embedding.
+*   `model` (type: `str`, required): The name of the embedding model to use.
+*   `address` (type: `str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
+*   `output_dimension` (type: `int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
+
+    For most API types, the function internally keeps a registry for the default output dimension of known model.
+    You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet.
+
+*   `task_type` (type: `str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
+
+:::note Supported APIs for Text Embedding
+
+Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/llm#llm-api-types) for which APIs support text embedding functionality.
+
+:::
+
+Input data:
+
+*   `text` (type: `str`, required): The text to embed.
+
+Return type: `vector[float32; N]`, where `N` is the dimension of the embedding vector determined by the model.
+
diff --git a/docs/sidebars.ts b/docs/sidebars.ts
@@ -61,4 +61,4 @@ const sidebars: SidebarsConfig = {
   ],
 };
 
-export default sidebars;
+export default sidebars;
diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py
@@ -24,7 +24,7 @@ def code_to_embedding(
     # You can also switch to Voyage embedding model:
     #    return text.transform(
     #        cocoindex.functions.EmbedText(
-    #            api_type=cocoindex.llm.LlmApiType.VOYAGE,
+    #            api_type=cocoindex.LlmApiType.VOYAGE,
     #            model="voyage-code-3",
     #        )
     #    )
diff --git a/examples/text_embedding/main.py b/examples/text_embedding/main.py
@@ -19,7 +19,7 @@ def text_to_embedding(
     # You can also switch to remote embedding model:
     #   return text.transform(
     #       cocoindex.functions.EmbedText(
-    #           api_type=cocoindex.llm.LlmApiType.OPENAI,
+    #           api_type=cocoindex.LlmApiType.OPENAI,
     #           model="text-embedding-3-small",
     #       )
     #   )