Skip to content

DineshKuppan/semantic-code-index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

semantic-code-index

semantic-code-index is designed to provide instant, context-aware, and semantically rich code search for developers

Prerequisites

  • Python 3.8+
  • pip
  • git
  • Docker (optional, for running the service)

Cloning the Repository

Clone this repository using:

git clone git@github.com:DineshKuppan/semantic-code-index.git
cd semantic-code-index

Setting Up the Environment

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate

Install the required Python packages:

pip install -r requirements.txt

Usage

To use semantic-code-index, you can either run it as a service, use the CLI, or use it directly in your Python scripts.

Running as a Service

You can run the service using Docker or directly with Python.

Using Docker
docker build -t semantic-code-index .
docker run -p 8000:8000 semantic-code-index
Using Python
python -m semantic_code_index

Using the CLI (CodeBERT Indexer)

The project includes a refactored CodeBERT indexer that provides a command-line interface for indexing and searching codebases:

# Index a codebase
python cli.py --scan /path/to/codebase --index-dir ./code_index --stats

# Search for similar code
python cli.py --search "def calculate_distance(point1, point2):" --top-k 5

# Use Vespa.ai vector database for storing and retrieving embeddings
python cli.py --scan /path/to/codebase --vespa

# Search using Vespa.ai
python cli.py --search "def calculate_distance(point1, point2):" --vespa --top-k 5

# Connect to an existing Vespa endpoint
python cli.py --search "def process_data" --vespa --vespa-endpoint http://localhost:8080

Using in Python Scripts

You can import and use the semantic_code_index module in your Python scripts:

from semantic_code_index import SemanticCodeIndex
index = SemanticCodeIndex()

index.index_codebase('/path/to/your/codebase')
results = index.search('your search query')
print(results)

Features

  • Semantic Code Search: Search your codebase using natural language queries and retrieve relevant code snippets based on meaning and intent.
  • Instant Indexing: Rapidly indexes your entire codebase, enabling immediate search functionality after setup.
  • Context-Aware Results: Delivers search results that understand the structure, logic, and relationships within your code, providing contextually accurate matches.
  • Vector Embedding: Generates vector representations of code snippets, enabling similarity-based search and retrieval.
  • Multi-language Support: Works across multiple programming languages, supporting diverse codebases.
  • Synonym and Intent Recognition: Captures synonyms and understands the intent behind queries, expanding the scope of searchable information.
  • Related Code Discovery: Identifies and surfaces related code assets, helping you explore connected implementations and patterns.
  • Privacy and Security: All indexing and search operations are performed locally, ensuring your code never leaves your machine.
  • Fine-Grained Access Control: Honors repository permissions and user roles, showing only code you are authorized to access.
  • Seamless Integration: Easily integrates with your existing development workflow and tools.
  • Fast and Accurate Retrieval: Provides near-instantaneous search results, even in large codebases, using efficient vector similarity search.
  • Continuous Index Updates: Automatically updates the semantic index as your codebase evolves, ensuring results are always up to date.

Project Structure

The codebase has been refactored into a modular structure:

  • models/ - Data models
    • code_models.py - Contains the CodeFile and CodeEmbedding dataclasses
  • parsers/ - Code parsing utilities
    • code_parser.py - Contains the CodeParser class for language detection and code analysis
  • indexers/ - Embedding and indexing functionality
    • codebert_indexer.py - Contains the CodeBERTIndexer class for generating embeddings
    • vespa_embedding_store.py - Contains the VespaEmbeddingStore class for Vespa.ai integration
  • cli.py - Command-line interface
  • codebert_indexer.py - Main entry point

Vector Database Integration

The system now integrates with Vespa.ai vector database for efficient storage and retrieval of code embeddings:

  • Vespa.ai Integration: Store and search code embeddings using Vespa's vector similarity search capabilities
  • Docker Deployment: Automatically deploy Vespa in a Docker container for development and testing
  • Efficient Similarity Search: Leverage Vespa's fast nearest neighbor search algorithms for finding similar code snippets
  • Backward Compatibility: Fall back to file-based storage when Vespa is not available
  • Flexible Deployment: Connect to an existing Vespa endpoint or deploy a new instance

About

semantic-code-index: Instantly search your codebase with context-aware, semantic understanding - find relevant code using natural language queries.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages