semantic-code-index is designed to provide instant, context-aware, and semantically rich code search for developers
- Python 3.8+
- pip
- git
- Docker (optional, for running the service)
Clone this repository using:
git clone git@github.com:DineshKuppan/semantic-code-index.git
cd semantic-code-indexCreate a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activateInstall the required Python packages:
pip install -r requirements.txtTo use semantic-code-index, you can either run it as a service, use the CLI, or use it directly in your Python scripts.
You can run the service using Docker or directly with Python.
docker build -t semantic-code-index .
docker run -p 8000:8000 semantic-code-indexpython -m semantic_code_indexThe project includes a refactored CodeBERT indexer that provides a command-line interface for indexing and searching codebases:
# Index a codebase
python cli.py --scan /path/to/codebase --index-dir ./code_index --stats
# Search for similar code
python cli.py --search "def calculate_distance(point1, point2):" --top-k 5
# Use Vespa.ai vector database for storing and retrieving embeddings
python cli.py --scan /path/to/codebase --vespa
# Search using Vespa.ai
python cli.py --search "def calculate_distance(point1, point2):" --vespa --top-k 5
# Connect to an existing Vespa endpoint
python cli.py --search "def process_data" --vespa --vespa-endpoint http://localhost:8080You can import and use the semantic_code_index module in your Python scripts:
from semantic_code_index import SemanticCodeIndex
index = SemanticCodeIndex()
index.index_codebase('/path/to/your/codebase')
results = index.search('your search query')
print(results)- Semantic Code Search: Search your codebase using natural language queries and retrieve relevant code snippets based on meaning and intent.
- Instant Indexing: Rapidly indexes your entire codebase, enabling immediate search functionality after setup.
- Context-Aware Results: Delivers search results that understand the structure, logic, and relationships within your code, providing contextually accurate matches.
- Vector Embedding: Generates vector representations of code snippets, enabling similarity-based search and retrieval.
- Multi-language Support: Works across multiple programming languages, supporting diverse codebases.
- Synonym and Intent Recognition: Captures synonyms and understands the intent behind queries, expanding the scope of searchable information.
- Related Code Discovery: Identifies and surfaces related code assets, helping you explore connected implementations and patterns.
- Privacy and Security: All indexing and search operations are performed locally, ensuring your code never leaves your machine.
- Fine-Grained Access Control: Honors repository permissions and user roles, showing only code you are authorized to access.
- Seamless Integration: Easily integrates with your existing development workflow and tools.
- Fast and Accurate Retrieval: Provides near-instantaneous search results, even in large codebases, using efficient vector similarity search.
- Continuous Index Updates: Automatically updates the semantic index as your codebase evolves, ensuring results are always up to date.
The codebase has been refactored into a modular structure:
models/- Data modelscode_models.py- Contains theCodeFileandCodeEmbeddingdataclasses
parsers/- Code parsing utilitiescode_parser.py- Contains theCodeParserclass for language detection and code analysis
indexers/- Embedding and indexing functionalitycodebert_indexer.py- Contains theCodeBERTIndexerclass for generating embeddingsvespa_embedding_store.py- Contains theVespaEmbeddingStoreclass for Vespa.ai integration
cli.py- Command-line interfacecodebert_indexer.py- Main entry point
The system now integrates with Vespa.ai vector database for efficient storage and retrieval of code embeddings:
- Vespa.ai Integration: Store and search code embeddings using Vespa's vector similarity search capabilities
- Docker Deployment: Automatically deploy Vespa in a Docker container for development and testing
- Efficient Similarity Search: Leverage Vespa's fast nearest neighbor search algorithms for finding similar code snippets
- Backward Compatibility: Fall back to file-based storage when Vespa is not available
- Flexible Deployment: Connect to an existing Vespa endpoint or deploy a new instance