Codebase Companion is a full‑stack AI app that lets you chat with any public GitHub repository. It clones a repo, chunks & embeds the content, stores vectors in Astra DB, and answers questions using a RAG pipeline powered by Hugging Face embeddings and Groq (Llama 3 8B).
- 📚 Multi‑Repository Support – Index multiple repos; switch chat sessions instantly.
- 🧠 Intelligent Q&A – Ask about logic, structure, or purpose in natural language.
- 🔁 Streaming Responses – Word‑by‑word streaming for a ChatGPT‑like feel.
- 📎 Source Citing – Each answer lists the code files used as context.
- 🧩 Modern RAG Pipeline – Accurate, grounded answers using retrieve‑then‑read.
- 🔍 Code Location Search – Surface exact files/paths relevant to your query.
- ⚡ Fast Inference – Groq Llama 3 8B for low‑latency responses.
- Frontend: React, Vite, Tailwind CSS
- Backend: Node.js, Express.js
- AI & Data Processing:
- Embedding Model:
BAAI/bge-small-en-v1.5
(Hugging Face) - Vector Database: Astra DB (DataStax)
- LLM: Groq – Llama 3 8B
- Embedding Model:
- Tools:
simple-git
,cors
,dotenv
,concurrently
- Input: User submits a public GitHub repo URL.
- Clone & Parse: Backend clones the repo and walks the file tree.
- Chunking: Code/docs are split into semantic chunks.
- Embedding: Chunks embedded via
BAAI/bge-small-en-v1.5
. - Storage: Vectors + metadata saved to Astra DB (collections created dynamically).
- Semantic Retrieval: Top‑k chunks fetched from Astra DB.
- Context Assembly: Relevant snippets + paths composed.
- Answer Generation: Groq Llama 3 8B produces the final, cited answer.
- Node.js v18+
- npm
- Accounts/keys for Hugging Face, Groq, and Astra DB
git clone https://github.com/kartik0905/codebase-companion.git
cd codebase-companion
If using a single repo with shared root scripts:
npm install
# Frontend
cd client && npm install
# Backend
cd ../server && npm install
Create a .env
in the backend root (server/.env
if split; project root if monorepo) with:
# Hugging Face
HF_TOKEN="hf_..." # used for BAAI/bge-small-en-v1.5
# Groq
GROQ_API_KEY="gsk_..." # Llama 3 8B
# Astra DB (DataStax)
ASTRA_DB_APPLICATION_TOKEN="AstraCS:..."
ASTRA_DB_API_ENDPOINT="https://..." # REST endpoint for your DB keyspace
ASTRA_DB_COLLECTION="codebase_chunks" # app may create collections dynamically
Keep keys private. Do not commit
.env
.
npm run dev
# Backend: http://localhost:3001
# Frontend: http://localhost:5173
Terminal 1 — Backend
cd server
npm run dev
# http://localhost:3001
Terminal 2 — Frontend
cd client
npm run dev
# http://localhost:5173
codebase-companion/
├─ client/
│ ├─ src/
│ └─ package.json
├─ server/
│ ├─ routes/
│ ├─ services/
│ ├─ rag/
│ │ ├─ chunking.js
│ │ ├─ embed.js
│ │ └─ retrieve.js
│ ├─ server.js
│ └─ package.json
├─ README.md
└─ ...
POST /api/index
Body: { repoUrl: string }
→ clones, chunks, embeds, and stores vectors.
POST /api/chat
Body: { repoId: string, question: string }
→ streams an answer + cites files.
Endpoint names are placeholders; adjust to match your actual routes.
- Ignore large/binary folders (
.git
,node_modules
,dist
, images) during indexing. - Tune chunk size/overlap for your languages to maximize retrieval quality.
- Persist per‑repo metadata so users can switch sessions quickly.
- 🔐 User Authentication to associate repos with users
- 🔒 Private Repos via GitHub OAuth
- ☁️ Cloud Deploy (Vercel + Render/Fly/railway)
- 📈 Analytics (query quality, hit‑rate, latency)
- 🧪 Eval Suite for retrieval precision/recall