Skip to content

Commit 55e14ee

Browse files
authored
docs: codebase indexing example & documentation visual (#912)
1 parent e248f6b commit 55e14ee

File tree

5 files changed

+96
-54
lines changed

5 files changed

+96
-54
lines changed

docs/docs/examples/examples/codebase_index.md

Lines changed: 70 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -10,25 +10,52 @@ sidebar_custom_props:
1010
tags: [vector-index, codebase]
1111
---
1212

13-
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
13+
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
1414

1515
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding"/>
1616
<YouTubeButton url="https://youtu.be/G3WstvhHO24?si=ndYfM0XRs03_hVPR" />
1717

18-
## Setup
18+
## Overview
19+
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
1920

20-
If you don't have Postgres installed, please follow [installation guide](https://cocoindex.io/docs/getting_started/installation).
21+
## Use Cases
22+
A wide range of applications can be built with an effective codebase index that is always up-to-date. Some examples include:
2123

22-
## Add the codebase as a source.
24+
![Use case illustration](/img/examples/codebase_index/usecase.png)
25+
26+
- Semantic code context for AI coding agents like Claude, Codex, Gemini CLI.
27+
- MCP for code editors such as Cursor, Windsurf, and VSCode.
28+
- Context-aware code search applications—semantic code search, natural language code retrieval.
29+
- Context for code review agents—AI code review, automated code analysis, code quality checks, pull request summarization.
30+
- Automated code refactoring, large-scale code migration.
31+
- Enhance SRE workflows: enable rapid root cause analysis, incident response, and change impact assessment by indexing infrastructure-as-code, deployment scripts, and config files for semantic search and lineage tracking.
32+
- Automatically generate design documentation from code—keep design docs up-to-date.
33+
34+
## Flow Overview
35+
36+
![Flow Overview](/img/examples/codebase_index/flow.png)
37+
38+
The flow is composed of the following steps:
39+
40+
- Read code files from the local filesystem
41+
- Extract file extensions, to get the language of the code for Tree-sitter to parse
42+
- Split code into semantic chunks using Tree-sitter
43+
- Generate embeddings for each chunk
44+
- Store in a vector database for retrieval
2345

24-
Ingest files from the CocoIndex codebase root directory.
46+
## Setup
47+
- Install Postgres, follow [installation guide](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
48+
- Install CocoIndex
49+
```bash
50+
pip install -U cocoindex
51+
```
52+
53+
## Add the codebase as a source.
54+
We will index the CocoIndex codebase. Here we use the `LocalFile` source to ingest files from the CocoIndex codebase root directory.
2555

2656
```python
2757
@cocoindex.flow_def(name="CodeEmbedding")
2858
def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
29-
"""
30-
Define an example flow that embeds files into a vector database.
31-
"""
3259
data_scope["files"] = flow_builder.add_source(
3360
cocoindex.sources.LocalFile(path="../..",
3461
included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
@@ -40,16 +67,15 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
4067
- Exclude files and directories starting `.`, `target` in the root and `node_modules` under any directory.
4168

4269
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
43-
See [documentation](https://cocoindex.io/docs/ops/sources) for more details.
70+
<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="Sources" />
4471

4572

46-
## Process each file and collect the information.
73+
## Process each file and collect the information
4774

48-
### Extract the extension of a filename
75+
### Extract the extension of a filename
4976

5077
We need to pass the language (or extension) to Tree-sitter to parse the code.
5178
Let's define a function to extract the extension of a filename while processing each file.
52-
You can find the documentation for custom function [here](https://cocoindex.io/docs/core/custom_function).
5379

5480
```python
5581
@cocoindex.op.function()
@@ -58,52 +84,43 @@ def extract_extension(filename: str) -> str:
5884
return os.path.splitext(filename)[1]
5985
```
6086

61-
Then we are going to process each file and collect the information.
62-
63-
```python
64-
with data_scope["files"].row() as file:
65-
file["extension"] = file["filename"].transform(extract_extension)
66-
```
67-
68-
Here we extract the extension of the filename and store it in the `extension` field.
69-
87+
<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />
7088

7189
### Split the file into chunks
72-
73-
We will chunk the code with Tree-sitter.
74-
We use the `SplitRecursively` function to split the file into chunks.
75-
It is integrated with Tree-sitter, so you can pass in the language to the `language` parameter.
76-
To see all supported language names and extensions, see the documentation [here](https://cocoindex.io/docs/ops/functions#splitrecursively). All the major languages are supported, e.g., Python, Rust, JavaScript, TypeScript, Java, C++, etc. If it's unspecified or the specified language is not supported, it will be treated as plain text.
90+
We use the `SplitRecursively` function to split the file into chunks. `SplitRecursively` is CocoIndex building block, with native integration with Tree-sitter. You need to pass in the language to the `language` parameter if you are processing code.
7791

7892
```python
7993
with data_scope["files"].row() as file:
94+
# Extract the extension of the filename.
95+
file["extension"] = file["filename"].transform(extract_extension)
8096
file["chunks"] = file["content"].transform(
8197
cocoindex.functions.SplitRecursively(),
8298
language=file["extension"], chunk_size=1000, chunk_overlap=300)
8399
```
100+
<DocumentationButton href="https://cocoindex.io/docs/ops/functions#splitrecursively" text="SplitRecursively" margin="0 0 16px 0" />
84101

102+
![SplitRecursively](/img/examples/codebase_index/chunk.png)
85103

86104
### Embed the chunks
87-
88105
We use `SentenceTransformerEmbed` to embed the chunks.
89-
You can refer to the documentation [here](https://cocoindex.io/docs/ops/functions#sentencetransformerembed).
90106

91107
```python
92108
@cocoindex.transform_flow()
93109
def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
94-
"""
95-
Embed the text using a SentenceTransformer model.
96-
"""
97110
return text.transform(
98111
cocoindex.functions.SentenceTransformerEmbed(
99112
model="sentence-transformers/all-MiniLM-L6-v2"))
100113
```
101114

102-
Then for each chunk, we will embed it using the `code_to_embedding` function. and collect the embeddings to the `code_embeddings` collector.
115+
<DocumentationButton href="https://cocoindex.io/docs/ops/functions#sentencetransformerembed" text="SentenceTransformerEmbed" margin="0 0 16px 0" />
116+
117+
:::tip
118+
`@cocoindex.transform_flow()` is needed to share the transformation across indexing and query. When building a vector index and querying against it, the embedding computation must remain consistent between indexing and querying.
119+
:::
103120

104-
`@cocoindex.transform_flow()` is needed to share the transformation across indexing and query. We build a vector index and query against it,
105-
the embedding computation needs to be consistent between indexing and querying. See [documentation](https://cocoindex.io/docs/query#transform-flow) for more details.
121+
<DocumentationButton href="https://cocoindex.io/docs/query#transform-flow" text="Transform Flow" margin="0 0 16px 0" />
106122

123+
Then for each chunk, we will embed it using the `code_to_embedding` function, and collect the embeddings to the `code_embeddings` collector.
107124

108125
```python
109126
with data_scope["files"].row() as file:
@@ -113,10 +130,7 @@ with data_scope["files"].row() as file:
113130
code=chunk["text"], embedding=chunk["embedding"])
114131
```
115132

116-
117-
### 2.4 Collect the embeddings
118-
119-
Export the embeddings to a table.
133+
### Export the embeddings
120134

121135
```python
122136
code_embeddings.export(
@@ -126,8 +140,7 @@ code_embeddings.export(
126140
vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
127141
```
128142

129-
We use Consine Similarity to measure the similarity between the query and the indexed data.
130-
To learn more about Consine Similarity, see [Wiki](https://en.wikipedia.org/wiki/Cosine_similarity).
143+
We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.
131144

132145
## Query the index
133146
We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
@@ -180,13 +193,16 @@ if __name__ == "__main__":
180193

181194
## Run the index setup & update
182195

183-
🎉 Now you are all set!
196+
- Install dependencies
197+
```bash
198+
pip install -e .
199+
```
184200

185-
Run following command to setup and update the index.
186-
```sh
187-
cocoindex update --setup main.py
188-
```
189-
You'll see the index updates state in the terminal
201+
- Setup and update the index
202+
```sh
203+
cocoindex update --setup main.py
204+
```
205+
You'll see the index updates state in the terminal
190206
191207
192208
## Test the query
@@ -197,7 +213,13 @@ python main.py
197213
```
198214
199215
When you see the prompt, you can enter your search query. for example: spec.
216+
The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched.
200217
201-
You can find the search results in the terminal
218+
## CocoInsight
219+
To get a better understanding of the indexing flow, you can use CocoInsight to help the development step by step.
220+
To spin up, it is super easy.
202221
203-
The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched.
222+
```
223+
cocoindex server main.py -ci
224+
```
225+
Follow the url from the terminal - "Open CocoInsight at: ..." to access the CocoInsight.

docs/src/components/GitHubButton/index.tsx

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
import type { ReactNode } from 'react';
22
import { FaGithub, FaYoutube } from 'react-icons/fa';
3+
import { MdMenuBook } from 'react-icons/md';
34

45
type ButtonProps = {
56
href: string;
67
children: ReactNode;
8+
margin?: string;
79
};
810

9-
function Button({ href, children }: ButtonProps): ReactNode {
11+
function Button({ href, children, margin = '0' }: ButtonProps): ReactNode {
1012
return (
1113
<a
1214
href={href}
@@ -15,6 +17,7 @@ function Button({ href, children }: ButtonProps): ReactNode {
1517
style={{
1618
display: 'inline-block',
1719
padding: '8px 12px',
20+
margin: margin,
1821
borderRadius: '4px',
1922
textDecoration: 'none',
2023
border: '1px solid #ccc',
@@ -29,11 +32,12 @@ function Button({ href, children }: ButtonProps): ReactNode {
2932

3033
type GitHubButtonProps = {
3134
url: string;
35+
margin?: string;
3236
};
3337

34-
function GitHubButton({ url }: GitHubButtonProps): ReactNode {
38+
function GitHubButton({ url, margin }: GitHubButtonProps): ReactNode {
3539
return (
36-
<Button href={url}>
40+
<Button href={url} margin={margin}>
3741
<FaGithub style={{ marginRight: '8px', verticalAlign: 'middle', fontSize: '1rem' }} />
3842
View on GitHub
3943
</Button>
@@ -42,15 +46,31 @@ function GitHubButton({ url }: GitHubButtonProps): ReactNode {
4246

4347
type YouTubeButtonProps = {
4448
url: string;
49+
margin?: string;
4550
};
4651

47-
function YouTubeButton({ url }: YouTubeButtonProps): ReactNode {
52+
function YouTubeButton({ url, margin }: YouTubeButtonProps): ReactNode {
4853
return (
49-
<Button href={url}>
54+
<Button href={url} margin={margin}>
5055
<FaYoutube style={{ marginRight: '8px', verticalAlign: 'middle', fontSize: '1rem' }} />
5156
Watch on YouTube
5257
</Button>
5358
);
5459
}
5560

56-
export { GitHubButton, YouTubeButton };
61+
type DocumentationButtonProps = {
62+
href: string;
63+
text: string;
64+
margin?: string;
65+
};
66+
67+
function DocumentationButton({ href, text, margin }: DocumentationButtonProps): ReactNode {
68+
return (
69+
<Button href={href} margin={margin}>
70+
<MdMenuBook style={{ marginRight: '8px', verticalAlign: 'middle', fontSize: '1rem' }} />
71+
{text}
72+
</Button>
73+
);
74+
}
75+
76+
export { GitHubButton, YouTubeButton, DocumentationButton };
99.4 KB
Loading
35.6 KB
Loading
58.7 KB
Loading

0 commit comments

Comments
 (0)