Skip to content

Commit a6b8c95

Browse files
committed
example: update docs/examples for quickstart to direct query
1 parent 8d53195 commit a6b8c95

File tree

3 files changed

+157
-45
lines changed

3 files changed

+157
-45
lines changed

docs/docs/getting_started/quickstart.md

Lines changed: 125 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -121,46 +121,14 @@ Notes:
121121

122122
6. In CocoIndex, a *collector* collects multiple entries of data together. In this example, the `doc_embeddings` collector collects data from all `chunk`s across all `doc`s, and using the collected data to build a vector index `"doc_embeddings"`, using `Postgres`.
123123

124-
### Step 2.2: Define the query handler
124+
### Step 2.2: Define the main function
125125

126-
Starting from the query handler:
127-
128-
```python title="quickstart.py"
129-
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
130-
name="SemanticsSearch",
131-
flow=text_embedding_flow,
132-
target_name="doc_embeddings",
133-
query_transform_flow=lambda text: text.transform(
134-
cocoindex.functions.SentenceTransformerEmbed(
135-
model="sentence-transformers/all-MiniLM-L6-v2")),
136-
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
137-
```
138-
139-
This handler queries the vector index `"doc_embeddings"`, and uses the same embedding model `"sentence-transformers/all-MiniLM-L6-v2"` to transform query text into vectors for similarity matching.
140-
141-
142-
### Step 2.3: Define the main function
143-
144-
The main function is used to interact with users and run queries using the query handler above.
126+
We can provide an empty main function for now, with a `@cocoindex.main_fn()` decorator:
145127
146128
```python title="quickstart.py"
147129
@cocoindex.main_fn()
148130
def _main():
149-
# Run queries to demonstrate the query capabilities.
150-
while True:
151-
try:
152-
query = input("Enter search query (or Enter to quit): ")
153-
if query == '':
154-
break
155-
results, _ = query_handler.search(query, 10)
156-
print("\nSearch results:")
157-
for result in results:
158-
print(f"[{result.score:.3f}] {result.data['filename']}")
159-
print(f" {result.data['text']}")
160-
print("---")
161-
print()
162-
except KeyboardInterrupt:
163-
break
131+
pass
164132

165133
if __name__ == "__main__":
166134
_main()
@@ -171,7 +139,6 @@ The `@cocoindex.main_fn` declares a function as the main function for an indexin
171139
* Initialize the CocoIndex librart states. Settings (e.g. database URL) are loaded from environment variables by default.
172140
* When the CLI is invoked with `cocoindex` subcommand, `cocoindex CLI` takes over the control, which provides convenient ways to manage the index. See the next step for more details.
173141
174-
175142
## Step 3: Run the indexing pipeline and queries
176143
177144
Specify the database URL by environment variable:
@@ -206,9 +173,129 @@ It will run for a few seconds and output the following statistics:
206173
documents: 3 added, 0 removed, 0 updated
207174
```
208175
209-
### Step 3.3: Run queries against the index
176+
## Step 4 (optional): Run queries against the index
177+
178+
CocoIndex excels at transforming your data and storing it (a.k.a. indexing).
179+
The goal of transforming your data is usually to query against it.
180+
Once you already have your index built, you can directly access the transformed data in the target database.
181+
CocoIndex also provides utilities for you to do this more seamlessly.
182+
183+
In this example, we'll use the [`psycopg` library](https://www.psycopg.org/) to connect to the database and run queries.
184+
Please make sure it's installed:
185+
186+
```bash
187+
pip install psycopg[binary,pool]
188+
```
189+
190+
### Step 4.1: Extract common transformations
191+
192+
Between your indexing flow and the query logic, one piece of transformation is shared: compute the embedding of a text.
193+
i.e. they should use exactly the same embedding model and parameters.
194+
195+
Let's extract that into a function:
196+
197+
```python title="quickstart.py"
198+
@cocoindex.transform_flow()
199+
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
200+
return text.transform(
201+
cocoindex.functions.SentenceTransformerEmbed(
202+
model="sentence-transformers/all-MiniLM-L6-v2"))
203+
```
204+
205+
`cocoindex.DataSlice[str]` represents certain data in the flow (e.g. a field in a data scope), with type `str` at runtime.
206+
Similar to the `text_embedding_flow()` above, the `text_to_embedding()` is also to constructing the flow instead of directly doing computation,
207+
so the type it takes is `cocoindex.DataSlice[str]` instead of `str`.
208+
See [Data Slice](../core/flow_def#data-slice) for more details.
209+
210+
211+
Then the corresponding code in the indexing flow can be simplified by calling this function:
212+
213+
```python title="quickstart.py"
214+
...
215+
# Transform data of each chunk
216+
with doc["chunks"].row() as chunk:
217+
# Embed the chunk, put into `embedding` field
218+
chunk["embedding"] = text_to_embedding(chunk["text"])
219+
220+
# Collect the chunk into the collector.
221+
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
222+
text=chunk["text"], embedding=chunk["embedding"])
223+
...
224+
```
225+
226+
The function decorator `@cocoindex.transform_flow()` is used to declare a function as a CocoIndex transform flow,
227+
i.e., a sub flow only performing transformations, without importing data from sources or exporting data to targets.
228+
The decorator is needed for evaluating the flow with specific input data in Step 4.2 below.
229+
230+
### Step 4.2: Provide the query logic
231+
232+
Now we can create a function to query the index upon a given input query:
233+
234+
```python title="quickstart.py"
235+
from psycopg_pool import ConnectionPool
236+
237+
def search(pool: ConnectionPool, query: str, top_k: int = 5):
238+
# Get the table name, for the export target in the text_embedding_flow above.
239+
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
240+
# Evaluate the transform flow defined above with the input query, to get the embedding.
241+
query_vector = text_to_embedding.eval(query)
242+
# Run the query and get the results.
243+
with pool.connection() as conn:
244+
with conn.cursor() as cur:
245+
cur.execute(f"""
246+
SELECT filename, text, embedding <=> %s::vector AS distance
247+
FROM {table_name} ORDER BY distance LIMIT %s
248+
""", (query_vector, top_k))
249+
return [
250+
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
251+
for row in cur.fetchall()
252+
]
253+
```
254+
255+
In the function above, most parts are standard query logic - you can use any libraries you like.
256+
There're two CocoIndex-specific logic:
257+
258+
1. Get the table name from the export target in the `text_embedding_flow` above.
259+
Since the table name for the `Postgres` target is not explicitly specified in the `export()` call,
260+
CocoIndex uses a default name.
261+
`cocoindex.utils.get_target_storage_default_name()` is a utility function to get the default table name for this case.
262+
263+
2. Evaluate the transform flow defined above with the input query, to get the embedding.
264+
It's done by the `eval()` method of the transform flow `text_to_embedding`.
265+
The return type of this method is `list[float]` as declared in the `text_to_embedding()` function (`cocoindex.DataSlice[list[float]]`).
266+
267+
### Step 4.3: Update the main function
268+
269+
Now we can update the main function to use the query function we just defined:
270+
271+
```python title="quickstart.py"
272+
@cocoindex.main_fn()
273+
def _run():
274+
# Initialize the database connection pool.
275+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
276+
# Run queries in a loop to demonstrate the query capabilities.
277+
while True:
278+
try:
279+
query = input("Enter search query (or Enter to quit): ")
280+
if query == '':
281+
break
282+
# Run the query function with the database connection pool and the query.
283+
results = search(pool, query)
284+
print("\nSearch results:")
285+
for result in results:
286+
print(f"[{result['score']:.3f}] {result['filename']}")
287+
print(f" {result['text']}")
288+
print("---")
289+
print()
290+
except KeyboardInterrupt:
291+
break
292+
```
293+
294+
It interacts with users and search the database by calling the `search()` method created in Step 4.2.
295+
296+
### Step 4.4: Run queries against the index
210297
211-
Now we have the index built. We can run the same Python file without additional arguments, which will run the main function defined in Step 2.3:
298+
Now we can run the same Python file, which will run the new main function:
212299
213300
```bash
214301
python quickstart.py

examples/text_embedding/main.py

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1+
import os
12
from dotenv import load_dotenv
3+
from psycopg_pool import ConnectionPool
24

35
import cocoindex
46

57
@cocoindex.transform_flow()
6-
def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
8+
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
79
"""
810
Embed the text using a SentenceTransformer model.
911
This is a shared logic between indexing and querying, so extract it as a function.
@@ -18,7 +20,7 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
1820
Define an example flow that embeds text into a vector database.
1921
"""
2022
data_scope["documents"] = flow_builder.add_source(
21-
cocoindex.sources.LocalFile(path="markdown_files"))
23+
cocoindex.sources.LocalFile(path="markdown_files", included_patterns=["*.md"]))
2224

2325
doc_embeddings = data_scope.add_collector()
2426

@@ -41,26 +43,45 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
4143
field_name="embedding",
4244
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
4345

44-
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
46+
# Keep for now to allow CocoInsight to query.
47+
# Will be removed later after we expose `search()` below as a query function (https://github.com/cocoindex-io/cocoindex/issues/502).
48+
cocoindex.query.SimpleSemanticsQueryHandler(
4549
name="SemanticsSearch",
4650
flow=text_embedding_flow,
4751
target_name="doc_embeddings",
4852
query_transform_flow=text_to_embedding,
4953
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
5054

55+
def search(pool: ConnectionPool, query: str, top_k: int = 5):
56+
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
57+
query_vector = text_to_embedding.eval(query)
58+
with pool.connection() as conn:
59+
with conn.cursor() as cur:
60+
cur.execute(f"""
61+
SELECT filename, location, text, embedding <=> %s::vector AS distance
62+
FROM {table_name}
63+
ORDER BY distance
64+
LIMIT %s
65+
""", (query_vector, top_k))
66+
return [
67+
{"filename": row[0], "location": row[1], "text": row[2], "score": 1.0 - row[3]}
68+
for row in cur.fetchall()
69+
]
70+
5171
@cocoindex.main_fn()
5272
def _run():
73+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
5374
# Run queries in a loop to demonstrate the query capabilities.
5475
while True:
5576
try:
5677
query = input("Enter search query (or Enter to quit): ")
5778
if query == '':
5879
break
59-
results, _ = query_handler.search(query, 10)
80+
results = search(pool, query)
6081
print("\nSearch results:")
6182
for result in results:
62-
print(f"[{result.score:.3f}] {result.data['filename']}")
63-
print(f" {result.data['text']}")
83+
print(f"[{result['score']:.3f}] {result['filename']} location:{result['location']}")
84+
print(f" {result['text']}")
6485
print("---")
6586
print()
6687
except KeyboardInterrupt:

examples/text_embedding/pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,11 @@ name = "text-embedding"
33
version = "0.1.0"
44
description = "Simple example for cocoindex: build embedding index based on local text files."
55
requires-python = ">=3.10"
6-
dependencies = ["cocoindex>=0.1.39", "python-dotenv>=1.0.1"]
6+
dependencies = [
7+
"cocoindex>=0.1.39",
8+
"python-dotenv>=1.0.1",
9+
"psycopg[binary,pool]",
10+
]
711

812
[tool.setuptools]
913
packages = []

0 commit comments

Comments
 (0)