Skip to content

Commit 89be497

Browse files
authored
example(lancedb): add lancedb target example (#1042)
* example(lancedb): add lancedb target example * chore: update `/README.md`
1 parent e08ee93 commit 89be497

File tree

7 files changed

+550
-0
lines changed

7 files changed

+550
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,7 @@ It defines an index flow like this:
181181
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
182182
| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |
183183
| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |
184+
| [Embeddings to LanceDB](examples/text_embedding_lancedb) | Index documents in a LanceDB collection for semantic search |
184185
| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |
185186
| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|
186187
| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
# Fallback to CPU for operations not supported by MPS on Mac.
5+
# It's no-op for other platforms.
6+
PYTORCH_ENABLE_MPS_FALLBACK=1
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/lancedb_data
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Build text embedding and semantic search 🔍 with LanceDB
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
CocoIndex supports LanceDB natively. In this example, we will build index flow from text embedding from local markdown files, and query the index. We will use **LanceDB** as the vector database.
6+
7+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
8+
9+
10+
## Steps
11+
### Indexing Flow
12+
13+
1. We will ingest a list of local files.
14+
2. For each file, perform chunking (recursively split) and then embedding.
15+
3. We will save the embeddings and the metadata in LanceDB.
16+
17+
### Query
18+
19+
1. We have `search()` as a [query handler](https://cocoindex.io/docs/query#query-handler), to query the LanceDB table with LanceDB client.
20+
2. We share the embedding operation `text_to_embedding()` between indexing and querying,
21+
by wrapping it as a [transform flow](https://cocoindex.io/docs/query#transform-flow).
22+
23+
## Pre-requisites
24+
25+
1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. Although the target store is LanceDB, CocoIndex uses Postgres to track the data lineage for incremental processing.
26+
27+
2. Install dependencies:
28+
29+
```sh
30+
pip install -e .
31+
```
32+
33+
LanceDB will automatically create a local database directory when you run the example (no additional setup required).
34+
35+
## Run
36+
37+
Update index, which will also setup LanceDB tables at the first time:
38+
39+
```bash
40+
cocoindex update --setup main
41+
```
42+
43+
You can also run the command with `-L`, which will watch for file changes and update the index automatically.
44+
45+
```bash
46+
cocoindex update --setup -L main
47+
```
48+
49+
## CocoInsight
50+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
51+
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
52+
53+
```bash
54+
cocoindex server -ci -L main
55+
```
56+
57+
Open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
58+
You can run queries in the CocoInsight UI.
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
from dotenv import load_dotenv
2+
import datetime
3+
import cocoindex
4+
import math
5+
import cocoindex.targets.lancedb as coco_lancedb
6+
7+
# Define LanceDB connection constants
8+
LANCEDB_URI = "./lancedb_data" # Local directory for LanceDB
9+
LANCEDB_TABLE = "TextEmbedding"
10+
11+
12+
@cocoindex.transform_flow()
13+
def text_to_embedding(
14+
text: cocoindex.DataSlice[str],
15+
) -> cocoindex.DataSlice[list[float]]:
16+
"""
17+
Embed the text using a SentenceTransformer model.
18+
This is a shared logic between indexing and querying, so extract it as a function.
19+
"""
20+
return text.transform(
21+
cocoindex.functions.SentenceTransformerEmbed(
22+
model="sentence-transformers/all-MiniLM-L6-v2"
23+
)
24+
)
25+
26+
27+
@cocoindex.flow_def(name="TextEmbeddingWithLanceDB")
28+
def text_embedding_flow(
29+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
30+
) -> None:
31+
"""
32+
Define an example flow that embeds text into a vector database.
33+
"""
34+
data_scope["documents"] = flow_builder.add_source(
35+
cocoindex.sources.LocalFile(path="markdown_files"),
36+
refresh_interval=datetime.timedelta(seconds=5),
37+
)
38+
39+
doc_embeddings = data_scope.add_collector()
40+
41+
with data_scope["documents"].row() as doc:
42+
doc["chunks"] = doc["content"].transform(
43+
cocoindex.functions.SplitRecursively(),
44+
language="markdown",
45+
chunk_size=500,
46+
chunk_overlap=100,
47+
)
48+
49+
with doc["chunks"].row() as chunk:
50+
chunk["embedding"] = text_to_embedding(chunk["text"])
51+
doc_embeddings.collect(
52+
id=cocoindex.GeneratedField.UUID,
53+
filename=doc["filename"],
54+
location=chunk["location"],
55+
text=chunk["text"],
56+
# 'text_embedding' is the name of the vector we've created the LanceDB table with.
57+
text_embedding=chunk["embedding"],
58+
)
59+
60+
doc_embeddings.export(
61+
"doc_embeddings",
62+
coco_lancedb.LanceDB(db_uri=LANCEDB_URI, table_name=LANCEDB_TABLE),
63+
primary_key_fields=["id"],
64+
# We cannot enable it when the table has no data yet, as LanceDB requires data to train the index.
65+
# See: https://github.com/lancedb/lance/issues/4034
66+
#
67+
# vector_indexes=[
68+
# cocoindex.VectorIndexDef(
69+
# "text_embedding", cocoindex.VectorSimilarityMetric.L2_DISTANCE
70+
# ),
71+
# ],
72+
)
73+
74+
75+
@text_embedding_flow.query_handler(
76+
result_fields=cocoindex.QueryHandlerResultFields(
77+
embedding=["embedding"],
78+
score="score",
79+
),
80+
)
81+
async def search(query: str) -> cocoindex.QueryOutput:
82+
print("Searching...", query)
83+
db = await coco_lancedb.connect_async(LANCEDB_URI)
84+
table = await db.open_table(LANCEDB_TABLE)
85+
86+
# Get the embedding for the query
87+
query_embedding = await text_to_embedding.eval_async(query)
88+
89+
search = await table.search(query_embedding, vector_column_name="text_embedding")
90+
search_results = await search.limit(5).to_list()
91+
92+
print(search_results)
93+
94+
return cocoindex.QueryOutput(
95+
results=[
96+
{
97+
"filename": result["filename"],
98+
"text": result["text"],
99+
"embedding": result["text_embedding"],
100+
# Qdrant's L2 "distance" is squared, so we take the square root to align with normal L2 distance
101+
"score": math.sqrt(result["_distance"]),
102+
}
103+
for result in search_results
104+
],
105+
query_info=cocoindex.QueryInfo(
106+
embedding=query_embedding,
107+
similarity_metric=cocoindex.VectorSimilarityMetric.L2_DISTANCE,
108+
),
109+
)

0 commit comments

Comments
 (0)