|
| 1 | +--- |
| 2 | +title: "Semantic Code Search" |
| 3 | +sidebarTitle: "Semantic Code Search" |
| 4 | +icon: "magnifying-glass" |
| 5 | +iconType: "solid" |
| 6 | +--- |
| 7 | + |
| 8 | +Codegen's `VectorIndex` enables semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present. |
| 9 | + |
| 10 | +<Warning>This is under active development. Interested in an application? [Reach out to the team!](/introduction/about.tsx)</Warning> |
| 11 | + |
| 12 | +## Basic Usage |
| 13 | + |
| 14 | +Create and save a vector index for your codebase: |
| 15 | + |
| 16 | +```python |
| 17 | +from codegen.extensions import VectorIndex |
| 18 | + |
| 19 | +# Initialize with your codebase |
| 20 | +index = VectorIndex(codebase) |
| 21 | + |
| 22 | +# Create embeddings for all files |
| 23 | +index.create() |
| 24 | + |
| 25 | +# Save to disk (defaults to .codegen/vector_index.pkl) |
| 26 | +index.save() |
| 27 | +``` |
| 28 | + |
| 29 | +Later, load the index and perform semantic searches: |
| 30 | + |
| 31 | +```python |
| 32 | +# Create a codebase |
| 33 | +codebase = Codebase.from_repo('fastapi/fastapi') |
| 34 | + |
| 35 | +# Load a previously created index |
| 36 | +index = VectorIndex(codebase) |
| 37 | +index.load() |
| 38 | + |
| 39 | +# Search with natural language |
| 40 | +results = index.similarity_search( |
| 41 | + "How does FastAPI handle dependency injection?", |
| 42 | + k=5 # number of results |
| 43 | +) |
| 44 | + |
| 45 | +# Print results with previews |
| 46 | +for filepath, score in results: |
| 47 | + print(f"\nScore: {score:.3f} | File: {filepath}") |
| 48 | + file = codebase.get_file(filepath) |
| 49 | + print(f"Preview: {file.content[:200]}...") |
| 50 | +``` |
| 51 | + |
| 52 | +<Note> |
| 53 | +The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches. |
| 54 | +</Note> |
| 55 | + |
| 56 | +## Getting Embeddings |
| 57 | + |
| 58 | +You can also get embeddings for arbitrary text using the same model: |
| 59 | + |
| 60 | +```python |
| 61 | +# Get embeddings for a list of texts |
| 62 | +texts = [ |
| 63 | + "Some code or text to embed", |
| 64 | + "Another piece of text" |
| 65 | +] |
| 66 | +embeddings = index.get_embeddings(texts) # shape: (n_texts, embedding_dim) |
| 67 | +``` |
| 68 | + |
| 69 | +## How It Works |
| 70 | + |
| 71 | +The `VectorIndex` class: |
| 72 | +1. Processes each file in your codebase |
| 73 | +2. Splits large files into chunks that fit within token limits |
| 74 | +3. Uses OpenAI's text-embedding-3-small model to create embeddings |
| 75 | +4. Stores embeddings in a numpy array for efficient similarity search |
| 76 | +5. Saves the index to disk for reuse |
| 77 | + |
| 78 | +When searching: |
| 79 | +1. Your query is converted to an embedding using the same model |
| 80 | +2. Cosine similarity is computed between the query and all file embeddings |
| 81 | +3. The most similar files are returned, along with their similarity scores |
| 82 | + |
| 83 | +<Warning> |
| 84 | +Creating embeddings requires an OpenAI API key with access to the embeddings endpoint. |
| 85 | +</Warning> |
| 86 | + |
| 87 | +## Example Searches |
| 88 | + |
| 89 | +Here are some example semantic searches that demonstrate the power of the system: |
| 90 | + |
| 91 | +```python |
| 92 | +# Find authentication-related code |
| 93 | +results = index.similarity_search( |
| 94 | + "How is user authentication implemented?", |
| 95 | + k=3 |
| 96 | +) |
| 97 | + |
| 98 | +# Find error handling patterns |
| 99 | +results = index.similarity_search( |
| 100 | + "Show me examples of error handling and custom exceptions", |
| 101 | + k=3 |
| 102 | +) |
| 103 | + |
| 104 | +# Find configuration management |
| 105 | +results = index.similarity_search( |
| 106 | + "Where is the application configuration and settings handled?", |
| 107 | + k=3 |
| 108 | +) |
| 109 | +``` |
| 110 | + |
| 111 | +The semantic search can understand concepts and return relevant results even when the exact terms aren't present in the code. |
0 commit comments