Skip to content

Commit e0a2088

Browse files
authored
docs: vector index 101 example (#925)
1 parent 837710a commit e0a2088

File tree

5 files changed

+44
-23
lines changed

5 files changed

+44
-23
lines changed

docs/docs/examples/examples/simple_vector_index.md

Lines changed: 44 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -10,31 +10,31 @@ sidebar_custom_props:
1010
tags: [vector-index]
1111
---
1212

13-
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
13+
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
1414

1515
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding"/>
1616

1717

1818
## Overview
19-
In this blog, we will build index with text embeddings and query it with natural language.
19+
In this tutorial, we will build index with text embeddings and query it with natural language.
2020
We try to keep it minimalistic and focus on the gist of the indexing flow.
2121

2222

23-
## Prerequisites
24-
25-
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation).
26-
CocoIndex uses Postgres to keep track of data lineage for incremental processing.
27-
28-
## Define Indexing Flow
23+
## Flow Overview
24+
![Flow](/img/examples/simple_vector_index/flow.png)
2925

30-
### Flow Design
31-
The flow diagram illustrates how we'll process our codebase:
3226
1. Read text files from the local filesystem
3327
2. Chunk each document
3428
3. For each chunk, embed it with a text embedding model
3529
4. Store the embeddings in a vector database for retrieval
3630

37-
### 1. Ingest the files
31+
## Prerequisites
32+
33+
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation).
34+
CocoIndex uses Postgres to keep track of data lineage for incremental processing.
35+
36+
37+
## Add Source
3838

3939
```python
4040
@cocoindex.flow_def(name="TextEmbedding")
@@ -48,12 +48,13 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
4848
doc_embeddings = data_scope.add_collector()
4949
```
5050

51-
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`), we can refer to the [documentation](https://cocoindex.io/docs/ops/sources) for more details.
51+
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`)
52+
<DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Source" />
5253

5354

54-
### 2. Process each file and collect the embeddings
55+
## Process each file and collect the embeddings
5556

56-
#### 2.1 Chunk the file
57+
### Chunk the file
5758

5859
```python
5960
with data_scope["documents"].row() as doc:
@@ -62,11 +63,13 @@ with data_scope["documents"].row() as doc:
6263
language="markdown", chunk_size=2000, chunk_overlap=500)
6364
```
6465

66+
![Chunking](/img/examples/simple_vector_index/chunk.png)
6567

68+
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#splitrecursively" text="SplitRecursively" />
6669

67-
#### 2.2 Embed each chunk
70+
### Embed each chunk
6871

69-
```
72+
```python
7073
@cocoindex.transform_flow()
7174
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
7275
"""
@@ -77,14 +80,17 @@ def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[lis
7780
cocoindex.functions.SentenceTransformerEmbed(
7881
model="sentence-transformers/all-MiniLM-L6-v2"))
7982
```
83+
![Embedding](/img/examples/simple_vector_index/embed.png)
8084

8185
This code defines a transformation function that converts text into vector embeddings using the SentenceTransformer model.
8286
`@cocoindex.transform_flow()` is needed to share the transformation across indexing and query.
83-
This decorator marks this as a reusable transformation flow that can be called on specific input data from user code using `eval()`, as shown in the search function below.
87+
This decorator marks this as a reusable transformation flow that can be called on specific input data from user code using `eval()`, as shown in the search function below.
8488

85-
The function uses CocoIndex's built-in `SentenceTransformerEmbed` function to convert the input text into 384-dimensional embeddings
8689
The `MiniLM-L6-v2` model is a good balance of speed and quality for text embeddings, though you can swap in other SentenceTransformer models as needed.
8790

91+
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#sentencetransformerembed" text="SentenceTransformerEmbed" margin="0 0 16px 0" />
92+
93+
Plug in the `text_to_embedding` function and collect the embeddings.
8894

8995
```python
9096
with doc["chunks"].row() as chunk:
@@ -93,8 +99,7 @@ with doc["chunks"].row() as chunk:
9399
text=chunk["text"], embedding=chunk["embedding"])
94100
```
95101

96-
97-
#### 2.3 Export the embeddings
102+
## Export the embeddings
98103

99104
Export the embeddings to a table in Postgres.
100105

@@ -108,10 +113,14 @@ doc_embeddings.export(
108113
field_name="embedding",
109114
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
110115
```
116+
CocoIndex supports other vector databases as well, with 1-line switch.
117+
<DocumentationButton url="https://cocoindex.io/docs/ops/targets" text="Targets" />
118+
119+
## Query the index
111120

112-
### 3. Query the index
121+
CocoIndex doesn't provide additional query interface at the moment. We can write SQL or rely on the query engine by the target storage, if any.
113122

114-
CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage, if any.
123+
<DocumentationButton url="https://cocoindex.io/docs/ops/targets#postgres" text="Postgres" margin="0 0 16px 0" />
115124

116125
```python
117126
def search(pool: ConnectionPool, query: str, top_k: int = 5):
@@ -167,4 +176,16 @@ if __name__ == "__main__":
167176
- Start the interactive query in terminal.
168177
```sh
169178
python main.py
170-
```
179+
```
180+
181+
182+
## CocoInsight
183+
184+
You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see exactly how each field is constructed and what happens behind the scenes.
185+
186+
187+
```sh
188+
cocoindex server -ci main.py
189+
```
190+
191+
Follow the url `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server, with zero pipeline data retention.
282 KB
Loading
-79.6 KB
Loading
118 KB
Loading
46.7 KB
Loading

0 commit comments

Comments
 (0)