Skip to content

Commit 8f0281d

Browse files
authored
update code base embedding with query handling (#522)
1 parent a37eddc commit 8f0281d

File tree

2 files changed

+72
-38
lines changed

2 files changed

+72
-38
lines changed

examples/code_embedding/README.md

Lines changed: 45 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,71 @@
1-
# Build embedding index for codebase
1+
# Build real-time index for codebase
2+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
23

3-
![Build embedding index for codebase](https://cocoindex.io/blogs/assets/images/cover-9bf0a7cff69b66a40918ab2fc1cea0c7.png)
4+
CocoIndex provides built-in support for code base chunking, using Tree-sitter to keep syntax boundary. In this example, we will build real-time index for codebase using CocoIndex.
45

5-
In this example, we will build an embedding index for a codebase using CocoIndex. CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. [Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages.
6+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
67

8+
![Build embedding index for codebase](https://github.com/user-attachments/assets/6dc5ce89-c949-41d4-852f-ad95af163dbd)
9+
10+
[Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library. It is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Check out the list of supported languages [here](https://cocoindex.io/docs/ops/functions#splitrecursively) - in the `language` section.
711

8-
Please give [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us if you like our work. Thank you so much with a warm coconut hug 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
912

1013
## Tutorials
11-
- Blog with step by step tutorial [here](https://cocoindex.io/blogs/index-code-base-for-rag).
12-
- Video walkthrough [here](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2)
14+
- Step by step tutorial - Check out the [blog](https://cocoindex.io/blogs/index-code-base-for-rag).
15+
- Video tutorial - [Youtube](https://youtu.be/G3WstvhHO24?si=Bnxu67Ax5Lv8b-J2).
16+
17+
## Steps
18+
19+
### Indexing Flow
20+
<p align='center'>
21+
<img width="434" alt="Screenshot 2025-05-19 at 10 14 36 PM" src="https://github.com/user-attachments/assets/3a506034-698f-480a-b653-22184dae4e14" />
22+
</p>
23+
24+
1. We will ingest CocoIndex codebase.
25+
2. For each file, perform chunking (Tree-sitter) and then embedding.
26+
3. We will save the embeddings and the metadata in Postgres with PGVector.
27+
28+
### Query:
29+
We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
1330

1431

1532
## Prerequisite
1633
[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
1734

1835
## Run
1936

20-
Install dependencies:
21-
```bash
22-
pip install -e .
23-
```
24-
25-
Setup:
37+
- Install dependencies:
38+
```bash
39+
pip install -e .
40+
```
2641

27-
```bash
28-
python main.py cocoindex setup
29-
```
42+
- Setup:
3043

31-
Update index:
44+
```bash
45+
python main.py cocoindex setup
46+
```
3247

33-
```bash
34-
python main.py cocoindex update
35-
```
48+
- Update index:
49+
50+
```bash
51+
python main.py cocoindex update
52+
```
3653

37-
Run:
54+
- Run:
3855

39-
```bash
40-
python main.py
41-
```
56+
```bash
57+
python main.py
58+
```
4259

4360
## CocoInsight
44-
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
45-
46-
Run CocoInsight to understand your RAG data pipeline:
61+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
62+
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run the following command to start CocoInsight:
4763

4864
```
4965
python main.py cocoindex server -ci
5066
```
5167

5268
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
69+
70+
<img width="1305" alt="Chunking Visualization" src="https://github.com/user-attachments/assets/8e83b9a4-2bed-456b-83e5-b5381b28b84a" />
71+

examples/code_embedding/main.py

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from dotenv import load_dotenv
2-
2+
from psycopg_pool import ConnectionPool
33
import cocoindex
44
import os
55

@@ -8,7 +8,8 @@ def extract_extension(filename: str) -> str:
88
"""Extract the extension of a filename."""
99
return os.path.splitext(filename)[1]
1010

11-
def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
11+
@cocoindex.transform_flow()
12+
def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
1213
"""
1314
Embed the text using a SentenceTransformer model.
1415
"""
@@ -24,7 +25,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
2425
data_scope["files"] = flow_builder.add_source(
2526
cocoindex.sources.LocalFile(path="../..",
2627
included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
27-
excluded_patterns=[".*", "target", "**/node_modules"]))
28+
excluded_patterns=["**/.*", "target", "**/node_modules"]))
2829
code_embeddings = data_scope.add_collector()
2930

3031
with data_scope["files"].row() as file:
@@ -47,26 +48,40 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
4748
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
4849

4950

50-
query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
51-
name="SemanticsSearch",
52-
flow=code_embedding_flow,
53-
target_name="code_embeddings",
54-
query_transform_flow=code_to_embedding,
55-
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
51+
52+
def search(pool: ConnectionPool, query: str, top_k: int = 5):
53+
# Get the table name, for the export target in the code_embedding_flow above.
54+
table_name = cocoindex.utils.get_target_storage_default_name(code_embedding_flow, "code_embeddings")
55+
# Evaluate the transform flow defined above with the input query, to get the embedding.
56+
query_vector = code_to_embedding.eval(query)
57+
# Run the query and get the results.
58+
with pool.connection() as conn:
59+
with conn.cursor() as cur:
60+
cur.execute(f"""
61+
SELECT filename, code, embedding <=> %s::vector AS distance
62+
FROM {table_name} ORDER BY distance LIMIT %s
63+
""", (query_vector, top_k))
64+
return [
65+
{"filename": row[0], "code": row[1], "score": 1.0 - row[2]}
66+
for row in cur.fetchall()
67+
]
5668

5769
@cocoindex.main_fn()
5870
def _run():
71+
# Initialize the database connection pool.
72+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
5973
# Run queries in a loop to demonstrate the query capabilities.
6074
while True:
6175
try:
6276
query = input("Enter search query (or Enter to quit): ")
6377
if query == '':
6478
break
65-
results, _ = query_handler.search(query, 10)
79+
# Run the query function with the database connection pool and the query.
80+
results = search(pool, query)
6681
print("\nSearch results:")
6782
for result in results:
68-
print(f"[{result.score:.3f}] {result.data['filename']}")
69-
print(f" {result.data['code']}")
83+
print(f"[{result['score']:.3f}] {result['filename']}")
84+
print(f" {result['code']}")
7085
print("---")
7186
print()
7287
except KeyboardInterrupt:

0 commit comments

Comments
 (0)