Skip to content

Commit 86fc14c

Browse files
committed
docs: examples/text_embedding_qdrant
Signed-off-by: Anush008 <[email protected]>
1 parent 67e3608 commit 86fc14c

File tree

9 files changed

+539
-76
lines changed

9 files changed

+539
-76
lines changed

docs/docs/ops/storages.md

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,21 @@ The spec takes the following fields:
2727

2828
* `api_key` (type: `str`, optional). API key to authenticate requests with.
2929

30-
The field name for the vector embeddings must match the [vector name](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) used when the collection was created.
31-
32-
If no primary key is set during export, a random UUID is used as the Qdrant point ID.
33-
34-
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding).
30+
Before exporting, you must create a collection with a [vector name](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) that matches the vector field name in CocoIndex, and set `setup_by_user=True` during export.
31+
32+
Example:
33+
34+
```python
35+
doc_embeddings.export(
36+
"doc_embeddings",
37+
cocoindex.storages.Qdrant(
38+
collection_name="cocoindex",
39+
grpc_url="http://xyz-example.cloud-region.cloud-provider.cloud.qdrant.io:6334/",
40+
api_key="<your-api-key-here>",
41+
),
42+
primary_key_fields=["id_field"],
43+
setup_by_user=True,
44+
)
45+
```
46+
47+
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant).

examples/text_embedding/README.md

Lines changed: 6 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,7 @@
1-
## Description
1+
Simple example for cocoindex: build embedding index based on local files.
22

3-
Example to build a vector index in Qdrant based on local files.
4-
5-
## Pre-requisites
6-
7-
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
8-
9-
- Run Qdrant.
10-
11-
```bash
12-
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
13-
```
14-
15-
- [Create a collection](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) to export the embeddings to.
16-
17-
```bash
18-
curl -X PUT \
19-
'http://localhost:6333/collections/cocoindex' \
20-
--header 'Content-Type: application/json' \
21-
--data-raw '{
22-
"vectors": {
23-
"text_embedding": {
24-
"size": 384,
25-
"distance": "Cosine"
26-
}
27-
}
28-
}'
29-
```
30-
31-
You can view the collections and data with the Qdrant dashboard at <http://localhost:6333/dashboard>.
3+
## Prerequisite
4+
[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
325

336
## Run
347

@@ -56,14 +29,13 @@ Run:
5629
python main.py
5730
```
5831

59-
## CocoInsight
60-
32+
## CocoInsight
6133
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
6234

6335
Run CocoInsight to understand your RAG data pipeline:
6436

65-
```bash
37+
```
6638
python main.py cocoindex server -c https://cocoindex.io
6739
```
6840

69-
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
41+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).

examples/text_embedding/main.py

Lines changed: 13 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -2,79 +2,57 @@
22

33
import cocoindex
44

5-
65
def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
76
"""
87
Embed the text using a SentenceTransformer model.
98
This is a shared logic between indexing and querying, so extract it as a function.
109
"""
1110
return text.transform(
1211
cocoindex.functions.SentenceTransformerEmbed(
13-
model="sentence-transformers/all-MiniLM-L6-v2"
14-
)
15-
)
16-
12+
model="sentence-transformers/all-MiniLM-L6-v2"))
1713

1814
@cocoindex.flow_def(name="TextEmbedding")
19-
def text_embedding_flow(
20-
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
21-
):
15+
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
2216
"""
2317
Define an example flow that embeds text into a vector database.
2418
"""
2519
data_scope["documents"] = flow_builder.add_source(
26-
cocoindex.sources.LocalFile(path="markdown_files")
27-
)
20+
cocoindex.sources.LocalFile(path="markdown_files"))
2821

2922
doc_embeddings = data_scope.add_collector()
3023

3124
with data_scope["documents"].row() as doc:
3225
doc["chunks"] = doc["content"].transform(
3326
cocoindex.functions.SplitRecursively(),
34-
language="markdown",
35-
chunk_size=2000,
36-
chunk_overlap=500,
37-
)
27+
language="markdown", chunk_size=2000, chunk_overlap=500)
3828

3929
with doc["chunks"].row() as chunk:
4030
chunk["embedding"] = text_to_embedding(chunk["text"])
41-
doc_embeddings.collect(
42-
id=cocoindex.GeneratedField.UUID,
43-
filename=doc["filename"],
44-
location=chunk["location"],
45-
text=chunk["text"],
46-
# 'text_embedding' is the name of the vector we've created the Qdrant collection with.
47-
text_embedding=chunk["embedding"],
48-
)
31+
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
32+
text=chunk["text"], embedding=chunk["embedding"])
4933

5034
doc_embeddings.export(
5135
"doc_embeddings",
52-
cocoindex.storages.Qdrant(
53-
collection_name="cocoindex", grpc_url="http://localhost:6334/"
54-
),
55-
primary_key_fields=["id"],
56-
setup_by_user=True,
57-
)
58-
36+
cocoindex.storages.Postgres(),
37+
primary_key_fields=["filename", "location"],
38+
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
5939

6040
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
6141
name="SemanticsSearch",
6242
flow=text_embedding_flow,
6343
target_name="doc_embeddings",
6444
query_transform_flow=text_to_embedding,
65-
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
66-
)
67-
45+
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
6846

6947
@cocoindex.main_fn()
7048
def _run():
7149
# Run queries in a loop to demonstrate the query capabilities.
7250
while True:
7351
try:
7452
query = input("Enter search query (or Enter to quit): ")
75-
if query == "":
53+
if query == '':
7654
break
77-
results, _ = query_handler.search(query, 10, "text_embedding")
55+
results, _ = query_handler.search(query, 10)
7856
print("\nSearch results:")
7957
for result in results:
8058
print(f"[{result.score:.3f}] {result.data['filename']}")
@@ -84,7 +62,6 @@ def _run():
8462
except KeyboardInterrupt:
8563
break
8664

87-
8865
if __name__ == "__main__":
8966
load_dotenv(override=True)
90-
_run()
67+
_run()

examples/text_embedding/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ name = "text-embedding"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on local text files."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.19", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.19", "python-dotenv>=1.0.1"]
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
## Description
2+
3+
Example to build a vector index in Qdrant based on local files.
4+
5+
## Pre-requisites
6+
7+
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
8+
9+
- Run Qdrant.
10+
11+
```bash
12+
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
13+
```
14+
15+
- [Create a collection](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) to export the embeddings to.
16+
17+
```bash
18+
curl -X PUT \
19+
'http://localhost:6333/collections/cocoindex' \
20+
--header 'Content-Type: application/json' \
21+
--data-raw '{
22+
"vectors": {
23+
"text_embedding": {
24+
"size": 384,
25+
"distance": "Cosine"
26+
}
27+
}
28+
}'
29+
```
30+
31+
You can view the collections and data with the Qdrant dashboard at <http://localhost:6333/dashboard>.
32+
33+
## Run
34+
35+
Install dependencies:
36+
37+
```bash
38+
pip install -e .
39+
```
40+
41+
Setup:
42+
43+
```bash
44+
python main.py cocoindex setup
45+
```
46+
47+
Update index:
48+
49+
```bash
50+
python main.py cocoindex update
51+
```
52+
53+
Run:
54+
55+
```bash
56+
python main.py
57+
```
58+
59+
## CocoInsight
60+
61+
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
62+
63+
Run CocoInsight to understand your RAG data pipeline:
64+
65+
```bash
66+
python main.py cocoindex server -c https://cocoindex.io
67+
```
68+
69+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
from dotenv import load_dotenv
2+
3+
import cocoindex
4+
5+
6+
def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
7+
"""
8+
Embed the text using a SentenceTransformer model.
9+
This is a shared logic between indexing and querying, so extract it as a function.
10+
"""
11+
return text.transform(
12+
cocoindex.functions.SentenceTransformerEmbed(
13+
model="sentence-transformers/all-MiniLM-L6-v2"
14+
)
15+
)
16+
17+
18+
@cocoindex.flow_def(name="TextEmbedding")
19+
def text_embedding_flow(
20+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
21+
):
22+
"""
23+
Define an example flow that embeds text into a vector database.
24+
"""
25+
data_scope["documents"] = flow_builder.add_source(
26+
cocoindex.sources.LocalFile(path="markdown_files")
27+
)
28+
29+
doc_embeddings = data_scope.add_collector()
30+
31+
with data_scope["documents"].row() as doc:
32+
doc["chunks"] = doc["content"].transform(
33+
cocoindex.functions.SplitRecursively(),
34+
language="markdown",
35+
chunk_size=2000,
36+
chunk_overlap=500,
37+
)
38+
39+
with doc["chunks"].row() as chunk:
40+
chunk["embedding"] = text_to_embedding(chunk["text"])
41+
doc_embeddings.collect(
42+
id=cocoindex.GeneratedField.UUID,
43+
filename=doc["filename"],
44+
location=chunk["location"],
45+
text=chunk["text"],
46+
# 'text_embedding' is the name of the vector we've created the Qdrant collection with.
47+
text_embedding=chunk["embedding"],
48+
)
49+
50+
doc_embeddings.export(
51+
"doc_embeddings",
52+
cocoindex.storages.Qdrant(
53+
collection_name="cocoindex", grpc_url="http://localhost:6334/"
54+
),
55+
primary_key_fields=["id"],
56+
setup_by_user=True,
57+
)
58+
59+
60+
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
61+
name="SemanticsSearch",
62+
flow=text_embedding_flow,
63+
target_name="doc_embeddings",
64+
query_transform_flow=text_to_embedding,
65+
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
66+
)
67+
68+
69+
@cocoindex.main_fn()
70+
def _run():
71+
# Run queries in a loop to demonstrate the query capabilities.
72+
while True:
73+
try:
74+
query = input("Enter search query (or Enter to quit): ")
75+
if query == "":
76+
break
77+
results, _ = query_handler.search(query, 10, "text_embedding")
78+
print("\nSearch results:")
79+
for result in results:
80+
print(f"[{result.score:.3f}] {result.data['filename']}")
81+
print(f" {result.data['text']}")
82+
print("---")
83+
print()
84+
except KeyboardInterrupt:
85+
break
86+
87+
88+
if __name__ == "__main__":
89+
load_dotenv(override=True)
90+
_run()

0 commit comments

Comments
 (0)