Skip to content

Commit 915b85d

Browse files
authored
update text embedding qdrant with its native query path (#521)
1 parent 8f0281d commit 915b85d

File tree

4 files changed

+96
-69
lines changed

4 files changed

+96
-69
lines changed

examples/text_embedding/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,20 @@
44

55
In this example, we will build index flow from text embedding from local markdown files, and query the index.
66

7-
We appreicate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
7+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
88

9-
## Steps:
9+
## Steps
1010
🌱 A detailed step by step tutorial can be found here: [Get Started Documentation](https://cocoindex.io/docs/getting_started/quickstart)
1111

12-
### Indexing Flow:
12+
### Indexing Flow
1313
<img width="461" alt="Screenshot 2025-05-19 at 5 48 28 PM" src="https://github.com/user-attachments/assets/b6825302-a0c7-4b86-9a2d-52da8286b4bd" />
1414

15-
1. We will ingest from a list of local files.
16-
2. For each file, perform chunking (Recursive Split) and then embeddings.
15+
1. We will ingest a list of local files.
16+
2. For each file, perform chunking (recursively split) and then embedding.
1717
3. We will save the embeddings and the metadata in Postgres with PGVector.
1818

19-
### Query:
20-
We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
19+
### Query
20+
We will match against user-provided text by a SQL query, and reuse the embedding operation in the indexing flow.
2121

2222

2323
## Prerequisite
Lines changed: 60 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,87 @@
1-
## Description
1+
# Build text embedding and semantic search 🔍 with Qdrant
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
CocoIndex supports Qdrant natively - [documentation](https://cocoindex.io/docs/ops/storages#qdrant). In this example, we will build index flow from text embedding from local markdown files, and query the index. We will use **Qdrant** as the vector database.
6+
7+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
8+
9+
<img width="860" alt="CocoIndex supports Qdrant" src="https://github.com/user-attachments/assets/a9deecfa-dd94-4b97-a1b1-90488d8178df" />
10+
11+
## Steps
12+
### Indexing Flow
13+
<img width="480" alt="Index flow for text embedding" src="https://github.com/user-attachments/assets/44d47b5e-b49b-4f05-9a00-dcb8027602a1" />
14+
15+
1. We will ingest a list of local files.
16+
2. For each file, perform chunking (recursively split) and then embedding.
17+
3. We will save the embeddings and the metadata in Postgres with PGVector.
18+
19+
### Query
20+
We use Qdrant client to query the index, and reuse the embedding operation in the indexing flow.
221

3-
Example to build a vector index in Qdrant based on local files.
422

523
## Pre-requisites
624

7-
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
25+
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Although the target store is Qdrant, CocoIndex uses Postgress to track the data lineage for incremental processing.
826

927
- Run Qdrant.
1028

11-
```bash
12-
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
13-
```
29+
```bash
30+
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
31+
```
1432

1533
- [Create a collection](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) to export the embeddings to.
1634

17-
```bash
18-
curl -X PUT \
19-
'http://localhost:6333/collections/cocoindex' \
20-
--header 'Content-Type: application/json' \
21-
--data-raw '{
22-
"vectors": {
23-
"text_embedding": {
24-
"size": 384,
25-
"distance": "Cosine"
26-
}
27-
}
28-
}'
29-
```
30-
31-
You can view the collections and data with the Qdrant dashboard at <http://localhost:6333/dashboard>.
35+
```bash
36+
curl -X PUT \
37+
'http://localhost:6333/collections/cocoindex' \
38+
--header 'Content-Type: application/json' \
39+
--data-raw '{
40+
"vectors": {
41+
"text_embedding": {
42+
"size": 384,
43+
"distance": "Cosine"
44+
}
45+
}
46+
}'
47+
```
48+
49+
You can view the collections and data with the Qdrant dashboard at <http://localhost:6333/dashboard>.
3250

3351
## Run
3452

35-
Install dependencies:
53+
- Install dependencies:
3654

37-
```bash
38-
pip install -e .
39-
```
55+
```bash
56+
pip install -e .
57+
```
4058

41-
Setup:
59+
- Setup:
4260

43-
```bash
44-
python main.py cocoindex setup
45-
```
61+
```bash
62+
python main.py cocoindex setup
63+
```
4664

47-
Update index:
65+
- Update index:
4866

49-
```bash
50-
python main.py cocoindex update
51-
```
67+
```bash
68+
python main.py cocoindex update
69+
```
5270

53-
Run:
71+
- Run:
5472

55-
```bash
56-
python main.py
57-
```
73+
```bash
74+
python main.py
75+
```
5876

5977
## CocoInsight
60-
61-
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
62-
63-
Run CocoInsight to understand your RAG data pipeline:
78+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
79+
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
6480

6581
```bash
6682
python main.py cocoindex server -ci
6783
```
6884

69-
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
85+
Open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
86+
87+

examples/text_embedding_qdrant/main.py

Lines changed: 28 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,26 @@
11
from dotenv import load_dotenv
2+
from qdrant_client import QdrantClient
3+
from qdrant_client.http.models import Filter, FieldCondition, MatchValue
24

35
import cocoindex
46

7+
# Define Qdrant connection constants
8+
QDRANT_GRPC_URL = "http://localhost:6334"
9+
QDRANT_COLLECTION = "cocoindex"
510

6-
def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
11+
12+
@cocoindex.transform_flow()
13+
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
714
"""
815
Embed the text using a SentenceTransformer model.
916
This is a shared logic between indexing and querying, so extract it as a function.
1017
"""
1118
return text.transform(
1219
cocoindex.functions.SentenceTransformerEmbed(
13-
model="sentence-transformers/all-MiniLM-L6-v2"
14-
)
15-
)
20+
model="sentence-transformers/all-MiniLM-L6-v2"))
1621

1722

18-
@cocoindex.flow_def(name="TextEmbedding")
23+
@cocoindex.flow_def(name="TextEmbeddingWithQdrant")
1924
def text_embedding_flow(
2025
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
2126
):
@@ -50,35 +55,39 @@ def text_embedding_flow(
5055
doc_embeddings.export(
5156
"doc_embeddings",
5257
cocoindex.storages.Qdrant(
53-
collection_name="cocoindex", grpc_url="http://localhost:6334/"
58+
collection_name=QDRANT_COLLECTION, grpc_url=QDRANT_GRPC_URL
5459
),
5560
primary_key_fields=["id"],
5661
setup_by_user=True,
5762
)
5863

5964

60-
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
61-
name="SemanticsSearch",
62-
flow=text_embedding_flow,
63-
target_name="doc_embeddings",
64-
query_transform_flow=text_to_embedding,
65-
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
66-
)
67-
68-
6965
@cocoindex.main_fn()
7066
def _run():
67+
# Initialize Qdrant client
68+
client = QdrantClient(url=QDRANT_GRPC_URL, prefer_grpc=True)
69+
7170
# Run queries in a loop to demonstrate the query capabilities.
7271
while True:
7372
try:
7473
query = input("Enter search query (or Enter to quit): ")
7574
if query == "":
7675
break
77-
results, _ = query_handler.search(query, 10, "text_embedding")
76+
77+
# Get the embedding for the query
78+
query_embedding = text_to_embedding.eval(query)
79+
80+
search_results = client.search(
81+
collection_name=QDRANT_COLLECTION,
82+
query_vector=("text_embedding", query_embedding),
83+
limit=10
84+
)
7885
print("\nSearch results:")
79-
for result in results:
80-
print(f"[{result.score:.3f}] {result.data['filename']}")
81-
print(f" {result.data['text']}")
86+
for result in search_results:
87+
score = result.score
88+
payload = result.payload
89+
print(f"[{score:.3f}] {payload['filename']}")
90+
print(f" {payload['text']}")
8291
print("---")
8392
print()
8493
except KeyboardInterrupt:

examples/text_embedding_qdrant/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name = "text-embedding-qdrant"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on local text files."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
6+
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1", "qdrant-client>=1.6.0"]
77

88
[tool.setuptools]
99
packages = []

0 commit comments

Comments
 (0)