Skip to content

Commit 85b373e

Browse files
authored
checkpoint - new example docs (#891)
* upgrade docusaurus version * initial checkin
1 parent 102fde5 commit 85b373e

File tree

11 files changed

+1873
-1450
lines changed

11 files changed

+1873
-1450
lines changed
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
---
2+
title: Codebase Indexing
3+
description: Build a real-time codebase index for retrieval-augmented generation (RAG) using CocoIndex and Tree-sitter. Chunk, embed, and search code with semantic understanding.
4+
sidebar_class_name: hidden
5+
slug: /examples/code_index
6+
canonicalUrl: '/examples/code_index'
7+
---
8+
9+
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
10+
11+
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding"/>
12+
<YouTubeButton url="https://youtu.be/G3WstvhHO24?si=ndYfM0XRs03_hVPR" />
13+
14+
## Setup
15+
16+
If you don't have Postgres installed, please follow [installation guide](https://cocoindex.io/docs/getting_started/installation).
17+
18+
## Add the codebase as a source.
19+
20+
Ingest files from the CocoIndex codebase root directory.
21+
22+
```python
23+
@cocoindex.flow_def(name="CodeEmbedding")
24+
def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
25+
"""
26+
Define an example flow that embeds files into a vector database.
27+
"""
28+
data_scope["files"] = flow_builder.add_source(
29+
cocoindex.sources.LocalFile(path="../..",
30+
included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
31+
excluded_patterns=[".*", "target", "**/node_modules"]))
32+
code_embeddings = data_scope.add_collector()
33+
```
34+
35+
- Include files with the extensions of `.py`, `.rs`, `.toml`, `.md`, `.mdx`
36+
- Exclude files and directories starting `.`, `target` in the root and `node_modules` under any directory.
37+
38+
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
39+
See [documentation](https://cocoindex.io/docs/ops/sources) for more details.
40+
41+
42+
## Process each file and collect the information.
43+
44+
### Extract the extension of a filename
45+
46+
We need to pass the language (or extension) to Tree-sitter to parse the code.
47+
Let's define a function to extract the extension of a filename while processing each file.
48+
You can find the documentation for custom function [here](https://cocoindex.io/docs/core/custom_function).
49+
50+
```python
51+
@cocoindex.op.function()
52+
def extract_extension(filename: str) -> str:
53+
"""Extract the extension of a filename."""
54+
return os.path.splitext(filename)[1]
55+
```
56+
57+
Then we are going to process each file and collect the information.
58+
59+
```python
60+
with data_scope["files"].row() as file:
61+
file["extension"] = file["filename"].transform(extract_extension)
62+
```
63+
64+
Here we extract the extension of the filename and store it in the `extension` field.
65+
66+
67+
### Split the file into chunks
68+
69+
We will chunk the code with Tree-sitter.
70+
We use the `SplitRecursively` function to split the file into chunks.
71+
It is integrated with Tree-sitter, so you can pass in the language to the `language` parameter.
72+
To see all supported language names and extensions, see the documentation [here](https://cocoindex.io/docs/ops/functions#splitrecursively). All the major languages are supported, e.g., Python, Rust, JavaScript, TypeScript, Java, C++, etc. If it's unspecified or the specified language is not supported, it will be treated as plain text.
73+
74+
```python
75+
with data_scope["files"].row() as file:
76+
file["chunks"] = file["content"].transform(
77+
cocoindex.functions.SplitRecursively(),
78+
language=file["extension"], chunk_size=1000, chunk_overlap=300)
79+
```
80+
81+
82+
### Embed the chunks
83+
84+
We use `SentenceTransformerEmbed` to embed the chunks.
85+
You can refer to the documentation [here](https://cocoindex.io/docs/ops/functions#sentencetransformerembed).
86+
87+
```python
88+
@cocoindex.transform_flow()
89+
def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
90+
"""
91+
Embed the text using a SentenceTransformer model.
92+
"""
93+
return text.transform(
94+
cocoindex.functions.SentenceTransformerEmbed(
95+
model="sentence-transformers/all-MiniLM-L6-v2"))
96+
```
97+
98+
Then for each chunk, we will embed it using the `code_to_embedding` function. and collect the embeddings to the `code_embeddings` collector.
99+
100+
`@cocoindex.transform_flow()` is needed to share the transformation across indexing and query. We build a vector index and query against it,
101+
the embedding computation needs to be consistent between indexing and querying. See [documentation](https://cocoindex.io/docs/query#transform-flow) for more details.
102+
103+
104+
```python
105+
with data_scope["files"].row() as file:
106+
with file["chunks"].row() as chunk:
107+
chunk["embedding"] = chunk["text"].call(code_to_embedding)
108+
code_embeddings.collect(filename=file["filename"], location=chunk["location"],
109+
code=chunk["text"], embedding=chunk["embedding"])
110+
```
111+
112+
113+
### 2.4 Collect the embeddings
114+
115+
Export the embeddings to a table.
116+
117+
```python
118+
code_embeddings.export(
119+
"code_embeddings",
120+
cocoindex.storages.Postgres(),
121+
primary_key_fields=["filename", "location"],
122+
vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
123+
```
124+
125+
We use Consine Similarity to measure the similarity between the query and the indexed data.
126+
To learn more about Consine Similarity, see [Wiki](https://en.wikipedia.org/wiki/Cosine_similarity).
127+
128+
## Query the index
129+
We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
130+
131+
```python
132+
def search(pool: ConnectionPool, query: str, top_k: int = 5):
133+
# Get the table name, for the export target in the code_embedding_flow above.
134+
table_name = cocoindex.utils.get_target_storage_default_name(code_embedding_flow, "code_embeddings")
135+
# Evaluate the transform flow defined above with the input query, to get the embedding.
136+
query_vector = code_to_embedding.eval(query)
137+
# Run the query and get the results.
138+
with pool.connection() as conn:
139+
with conn.cursor() as cur:
140+
cur.execute(f"""
141+
SELECT filename, code, embedding <=> %s::vector AS distance
142+
FROM {table_name} ORDER BY distance LIMIT %s
143+
""", (query_vector, top_k))
144+
return [
145+
{"filename": row[0], "code": row[1], "score": 1.0 - row[2]}
146+
for row in cur.fetchall()
147+
]
148+
```
149+
150+
Define a main function to run the query in terminal.
151+
152+
```python
153+
def main():
154+
# Initialize the database connection pool.
155+
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
156+
# Run queries in a loop to demonstrate the query capabilities.
157+
while True:
158+
try:
159+
query = input("Enter search query (or Enter to quit): ")
160+
if query == '':
161+
break
162+
# Run the query function with the database connection pool and the query.
163+
results = search(pool, query)
164+
print("\nSearch results:")
165+
for result in results:
166+
print(f"[{result['score']:.3f}] {result['filename']}")
167+
print(f" {result['code']}")
168+
print("---")
169+
print()
170+
except KeyboardInterrupt:
171+
break
172+
173+
if __name__ == "__main__":
174+
main()
175+
```
176+
177+
## Run the index setup & update
178+
179+
🎉 Now you are all set!
180+
181+
Run following command to setup and update the index.
182+
```sh
183+
cocoindex update --setup main.py
184+
```
185+
You'll see the index updates state in the terminal
186+
187+
188+
## Test the query
189+
At this point, you can start the CocoIndex server and develop your RAG runtime against the data. To test your index, you could
190+
191+
``` bash
192+
python main.py
193+
```
194+
195+
When you see the prompt, you can enter your search query. for example: spec.
196+
197+
You can find the search results in the terminal
198+
199+
The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched.

docs/docs/examples/index.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
description: Learn to implement real-world solutions with CocoIndex through practical
3+
examples
4+
title: Featured
5+
canonicalUrl: '/examples'
6+
slug: '/examples'
7+
---
8+
9+
import DocCardList from '@theme/DocCardList';
10+
11+
<DocCardList />

docs/docusaurus.config.ts

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,18 @@ const config: Config = {
112112
target: '_self' // This makes the logo click follow the link in the same window
113113
},
114114
items: [
115-
{ to: '/docs/', label: 'Documentation', position: 'left', target: '_self' },
115+
{
116+
label: 'User guide',
117+
type: 'doc',
118+
docId: 'getting_started/overview',
119+
position: 'left',
120+
},
121+
{
122+
label: 'Examples',
123+
type: 'docSidebar',
124+
sidebarId: 'examples',
125+
position: 'left',
126+
},
116127
{ to: 'https://cocoindex.io/blogs/', label: 'Blog', position: 'left', target: '_self' },
117128
{
118129
type: 'html',

docs/package.json

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,23 +15,24 @@
1515
"typecheck": "tsc"
1616
},
1717
"dependencies": {
18-
"@docusaurus/core": "^3.8.0",
19-
"@docusaurus/plugin-client-redirects": "^3.8.0",
20-
"@docusaurus/preset-classic": "^3.8.0",
21-
"@docusaurus/theme-mermaid": "^3.8.0",
18+
"@docusaurus/core": "^3.8.1",
19+
"@docusaurus/plugin-client-redirects": "^3.8.1",
20+
"@docusaurus/preset-classic": "^3.8.1",
21+
"@docusaurus/theme-mermaid": "^3.8.1",
2222
"@mdx-js/react": "^3.0.0",
2323
"clsx": "^2.0.0",
2424
"mixpanel-browser": "^2.59.0",
2525
"posthog-docusaurus": "^2.0.2",
2626
"prism-react-renderer": "^2.4.0",
2727
"react": "^19.0.0",
2828
"react-dom": "^19.0.0",
29+
"react-icons": "^5.5.0",
2930
"react-player": "^2.16.0"
3031
},
3132
"devDependencies": {
32-
"@docusaurus/module-type-aliases": "3.7.0",
33-
"@docusaurus/tsconfig": "3.7.0",
34-
"@docusaurus/types": "3.7.0",
33+
"@docusaurus/module-type-aliases": "3.8.1",
34+
"@docusaurus/tsconfig": "3.8.1",
35+
"@docusaurus/types": "3.8.1",
3536
"typescript": "~5.8.2"
3637
},
3738
"browserslist": {

docs/sidebars.ts

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import type { SidebarsConfig } from '@docusaurus/plugin-content-docs';
22

33
const sidebars: SidebarsConfig = {
4-
tutorialSidebar: [
4+
docs: [
55
{
66
type: 'category',
77
label: 'Getting Started',
@@ -85,6 +85,20 @@ const sidebars: SidebarsConfig = {
8585
],
8686
},
8787
],
88+
examples: [
89+
{
90+
type: 'category',
91+
label: 'Examples',
92+
collapsed: false,
93+
link: {type: 'doc', id: 'examples/index'},
94+
items: [
95+
{
96+
type: 'autogenerated',
97+
dirName: 'examples/examples',
98+
},
99+
],
100+
},
101+
],
88102
};
89103

90104
export default sidebars;
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
import type { ReactNode } from 'react';
2+
import { FaGithub, FaYoutube } from 'react-icons/fa';
3+
4+
type ButtonProps = {
5+
href: string;
6+
children: ReactNode;
7+
};
8+
9+
function Button({ href, children }: ButtonProps): ReactNode {
10+
return (
11+
<a
12+
href={href}
13+
target="_blank"
14+
rel="noopener noreferrer"
15+
style={{
16+
display: 'inline-block',
17+
padding: '8px 12px',
18+
borderRadius: '4px',
19+
textDecoration: 'none',
20+
border: '1px solid #ccc',
21+
color: 'var(--ifm-color-default)',
22+
fontSize: '0.85rem',
23+
}}
24+
>
25+
{children}
26+
</a>
27+
);
28+
}
29+
30+
type GitHubButtonProps = {
31+
url: string;
32+
};
33+
34+
function GitHubButton({ url }: GitHubButtonProps): ReactNode {
35+
return (
36+
<Button href={url}>
37+
<FaGithub style={{ marginRight: '8px', verticalAlign: 'middle', fontSize: '1rem' }} />
38+
View on GitHub
39+
</Button>
40+
);
41+
}
42+
43+
type YouTubeButtonProps = {
44+
url: string;
45+
};
46+
47+
function YouTubeButton({ url }: YouTubeButtonProps): ReactNode {
48+
return (
49+
<Button href={url}>
50+
<FaYoutube style={{ marginRight: '8px', verticalAlign: 'middle', fontSize: '1rem' }} />
51+
Watch on YouTube
52+
</Button>
53+
);
54+
}
55+
56+
export { GitHubButton, YouTubeButton };

docs/src/css/custom.css

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,9 @@
9696
--ifm-h5-font-size: 0.8rem;
9797
}
9898
}
99+
.markdown h1:first-child {
100+
margin-bottom: 20px;
101+
}
99102

100103
.navbar__logo{
101104
height: 24px;
@@ -124,6 +127,8 @@
124127
font-family: 'Questrial', sans-serif;
125128
}
126129

130+
131+
127132
.footer {
128133
padding: 4rem 2rem;
129134
border-top: 1px solid var(--ifm-color-emphasis-100);
@@ -176,6 +181,7 @@
176181
flex-wrap: wrap;
177182
align-items: center;
178183
color: var(--theme-color-text-light);
184+
display: none;
179185

180186
.breadcrumbs__item:first-child {
181187
display: none;

0 commit comments

Comments
 (0)