Skip to content

An intelligent and scalable PR analysis platform powered by RAG. Deployed as containerized Docker microservices on Kubernetes for scalability. Can be ran on cloud.

Notifications You must be signed in to change notification settings

adsdemaybe/pr_resolver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PR Resolver

An intelligent, scalable PR analysis platform powered by RAG (Retrieval Augmented Generation).

PR Resolver uses semantic search + LLM reasoning to analyze pull requests by finding contextually similar code changes across your repository history. Deployed as containerized microservices on Kubernetes for enterprise-grade scalability.

🎯 Purpose

Webhook-driven analysis of PRs from GitHub, Bitbucket, Azure DevOps, and other version control systems. For each PR:

  1. Fetches diffs from the repository
  2. Chunks & indexes code changes with semantic embeddings
  3. Searches for similar historical changes
  4. Analyzes with Gemini LLM for intelligent insights and recommendations

πŸ—οΈ Architecture

VCS Webhooks (GitHub, Bitbucket, Azure DevOps)
        ↓
    Webhook Service (multiple replicas)
        ↓
    Repo Ingestor (clones & diffs)
        ↓
    DiffChunker (index-ready format)
        ↓
    ChromaDB Vector Store (persistent)
        ↓
    RAG Retriever + Gemini LLM
        ↓
    PR Analysis & Insights

πŸ“¦ Components

Core Services

  • services/webhook/ - Receives events from version control platforms
  • services/repo_ingestor/ - Clones repositories and generates diffs
  • services/rag/ - Semantic search and LLM analysis

RAG Module (services/rag/)

  • initializer.py - Creates & caches expensive resources:

    • Ollama embeddings model
    • ChromaDB vector store
    • Google Gemini LLM
  • db_repo_ingestor.py - Fetches diffs and converts to index-ready format:

    • RepoIngestorClient - API client for repo ingestor service
    • DiffChunker - Splits diffs into overlapping chunks with metadata & IDs
  • retriever.py - Queries the vector store:

    • Semantic search for similar diffs
    • Automatic query embedding
    • Collection statistics

Common Models

  • common/models/filediff.py - Diff data structures
  • common/models/commit.py - Commit metadata

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • Docker & Docker Compose
  • Kubernetes cluster (for production)
  • Google API key (for Gemini LLM)

Local Setup

  1. Clone the repository

    git clone https://github.com/adsdemaybe/pr_resolver.git
    cd pr_resolver
  2. Create virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
    pip install langchain-google-genai
  4. Set environment variables

    export GOOGLE_API_KEY="your-api-key"
    export OLLAMA_API_URL="http://localhost:11434"
  5. Start services (using Docker Compose)

    cd services/repo_ingestor
    docker-compose up -d
  6. Initialize RAG

    from services.rag.initializer import ChromaDBRAGStore
    from services.rag.retriever import ChromaDBRetriever
    
    # Create & cache expensive resources
    store = ChromaDBRAGStore()
    
    # Use the vector store for queries
    retriever = ChromaDBRetriever(vector_store=store.vector_store)

πŸ“š Usage Examples

Ingest Repository Diffs

from services.rag.db_repo_ingestor import RepoIngestorClient, DiffChunker

# Fetch diffs from repo_ingestor API
client = RepoIngestorClient("http://localhost:8000")
diffs = await client.preview_diffs(
    repo_url="https://github.com/myorg/myrepo.git",
    branch="main",
    max_commits=50
)

# Convert to index-ready format
chunker = DiffChunker(chunk_size=1000, chunk_overlap=100)
texts, metadatas, ids = chunker.diffs_to_index_format(diffs)

Add Diffs to Vector Store

from services.rag.initializer import ChromaDBRAGStore

store = ChromaDBRAGStore()
num_added = await store.add(diffs)
print(f"Added {num_added} diffs to ChromaDB")

Search for Similar Diffs

retriever = ChromaDBRetriever(vector_store=store.vector_store)

results = await retriever.search(
    query="fixed bug in authentication module",
    k=5,
    similarity_threshold=0.7
)

for doc, score in results:
    print(f"Score: {score}")
    print(f"File: {doc.metadata['file_path']}")
    print(f"Commit: {doc.metadata['commit_hash']}")

Get Collection Stats

stats = await retriever.get_stats()
print(f"Documents indexed: {stats['document_count']}")

🐳 Docker Deployment

Build Services

# Repo Ingestor
cd services/repo_ingestor
docker build -t pr-resolver/repo-ingestor:latest .

# Webhook Service (if available)
cd services/webhook
docker build -t pr-resolver/webhook:latest .

Run with Docker Compose

docker-compose -f docker-compose.yml up -d

☸️ Kubernetes Deployment

Deploy to Cluster

kubectl apply -f k8s/repo-ingestor-deployment.yaml
kubectl apply -f k8s/webhook-deployment.yaml
kubectl apply -f k8s/rag-service.yaml

Scale Services

# Scale repo ingestor to 3 replicas
kubectl scale deployment repo-ingestor --replicas=3

# Scale webhook listener to 5 replicas
kubectl scale deployment webhook --replicas=5

βš™οΈ Configuration

Environment Variables

Variable Default Description
GOOGLE_API_KEY Required Google Gemini API key
OLLAMA_API_URL http://localhost:11434 Ollama embeddings service URL
CHROMA_DB_PATH ./chroma_db ChromaDB persistence directory
REPO_INGESTOR_URL http://localhost:8000 Repo ingestor service URL

RAG Configuration

store = ChromaDBRAGStore(
    collection_name="pr_resolver_diffs",
    persist_directory="./chroma_db",
    embedding_model="nomic-embed-text",
    embedding_api_url="http://localhost:11434",
    llm_model="gemini-pro",
    google_api_key="your-key"
)

πŸ“Š Performance Tuning

Chunking Strategy

# Smaller chunks = more precise search, higher latency
chunker = DiffChunker(chunk_size=500, chunk_overlap=50)

# Larger chunks = faster search, less precision
chunker = DiffChunker(chunk_size=2000, chunk_overlap=200)

Search Parameters

# Higher k = more results to analyze
results = await retriever.search(query, k=10)

# Higher threshold = stricter relevance filtering
results = await retriever.search(query, k=5, similarity_threshold=0.8)

πŸ”„ Webhook Integration

Configure webhooks in your VCS:

  • GitHub: Repository Settings β†’ Webhooks β†’ Add webhook

    • Payload URL: https://your-domain/webhooks/github
  • Bitbucket: Repository Settings β†’ Webhooks β†’ Create trigger

    • URL: https://your-domain/webhooks/bitbucket
  • Azure DevOps: Project Settings β†’ Service hooks β†’ Create subscription

    • URL: https://your-domain/webhooks/azure-devops

πŸ§ͺ Testing

# Run tests
pytest tests/

# Run with coverage
pytest --cov=services tests/

πŸ“ License

MIT

🀝 Contributing

Contributions welcome! Please open an issue or submit a PR.

πŸ“§ Contact

For questions or support, reach out to the development team.

About

An intelligent and scalable PR analysis platform powered by RAG. Deployed as containerized Docker microservices on Kubernetes for scalability. Can be ran on cloud.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published