Skip to content

kartik0905/Codebase-Companion

Repository files navigation

🤖 Codebase-Companion

Chat with any public GitHub repository using Retrieval-Augmented Generation (RAG)

React Node.js Express Hugging Face Astra DB Groq


📌 Overview

Codebase Companion is a full‑stack AI app that lets you chat with any public GitHub repository. It clones a repo, chunks & embeds the content, stores vectors in Astra DB, and answers questions using a RAG pipeline powered by Hugging Face embeddings and Groq (Llama 3 8B).


✨ Features

  • 📚 Multi‑Repository Support – Index multiple repos; switch chat sessions instantly.
  • 🧠 Intelligent Q&A – Ask about logic, structure, or purpose in natural language.
  • 🔁 Streaming Responses – Word‑by‑word streaming for a ChatGPT‑like feel.
  • 📎 Source Citing – Each answer lists the code files used as context.
  • 🧩 Modern RAG Pipeline – Accurate, grounded answers using retrieve‑then‑read.
  • 🔍 Code Location Search – Surface exact files/paths relevant to your query.
  • Fast Inference – Groq Llama 3 8B for low‑latency responses.

🛠️ Tech Stack

  • Frontend: React, Vite, Tailwind CSS
  • Backend: Node.js, Express.js
  • AI & Data Processing:
    • Embedding Model: BAAI/bge-small-en-v1.5 (Hugging Face)
    • Vector Database: Astra DB (DataStax)
    • LLM: Groq – Llama 3 8B
  • Tools: simple-git, cors, dotenv, concurrently

⚙️ How It Works (RAG Pipeline)

Phase 1 — Index

  1. Input: User submits a public GitHub repo URL.
  2. Clone & Parse: Backend clones the repo and walks the file tree.
  3. Chunking: Code/docs are split into semantic chunks.
  4. Embedding: Chunks embedded via BAAI/bge-small-en-v1.5.
  5. Storage: Vectors + metadata saved to Astra DB (collections created dynamically).

Phase 2 — Query

  1. Semantic Retrieval: Top‑k chunks fetched from Astra DB.
  2. Context Assembly: Relevant snippets + paths composed.
  3. Answer Generation: Groq Llama 3 8B produces the final, cited answer.

🧪 Local Development

Prerequisites

  • Node.js v18+
  • npm
  • Accounts/keys for Hugging Face, Groq, and Astra DB

Clone

git clone https://github.com/kartik0905/codebase-companion.git
cd codebase-companion

Install (Monorepo)

If using a single repo with shared root scripts:

npm install

Install (Split: client / server)

# Frontend
cd client && npm install
# Backend
cd ../server && npm install

Environment Variables

Create a .env in the backend root (server/.env if split; project root if monorepo) with:

# Hugging Face
HF_TOKEN="hf_..."  # used for BAAI/bge-small-en-v1.5

# Groq
GROQ_API_KEY="gsk_..."  # Llama 3 8B

# Astra DB (DataStax)
ASTRA_DB_APPLICATION_TOKEN="AstraCS:..."
ASTRA_DB_API_ENDPOINT="https://..."  # REST endpoint for your DB keyspace
ASTRA_DB_COLLECTION="codebase_chunks"  # app may create collections dynamically

Keep keys private. Do not commit .env.


🚀 Run the App

All‑in‑one (concurrently)

npm run dev
# Backend: http://localhost:3001
# Frontend: http://localhost:5173

Split terminals

Terminal 1 — Backend

cd server
npm run dev
# http://localhost:3001

Terminal 2 — Frontend

cd client
npm run dev
# http://localhost:5173

📁 Folder Structure (example)

codebase-companion/
├─ client/
│  ├─ src/
│  └─ package.json
├─ server/
│  ├─ routes/
│  ├─ services/
│  ├─ rag/
│  │  ├─ chunking.js
│  │  ├─ embed.js
│  │  └─ retrieve.js
│  ├─ server.js
│  └─ package.json
├─ README.md
└─ ...

🔌 API (quick peek)

POST /api/index
Body: { repoUrl: string } → clones, chunks, embeds, and stores vectors.

POST /api/chat
Body: { repoId: string, question: string } → streams an answer + cites files.

Endpoint names are placeholders; adjust to match your actual routes.


🧭 Tips

  • Ignore large/binary folders (.git, node_modules, dist, images) during indexing.
  • Tune chunk size/overlap for your languages to maximize retrieval quality.
  • Persist per‑repo metadata so users can switch sessions quickly.

🗺️ Roadmap / Future Improvements

  • 🔐 User Authentication to associate repos with users
  • 🔒 Private Repos via GitHub OAuth
  • ☁️ Cloud Deploy (Vercel + Render/Fly/railway)
  • 📈 Analytics (query quality, hit‑rate, latency)
  • 🧪 Eval Suite for retrieval precision/recall

🙌 Acknowledgments


Built with ❤️ by Kartik Garg

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published