An AI-powered system that lets you talk to the history of any GitHub repository — not just its current code.
Legacy Code Archaeologist is a production-grade Retrieval-Augmented Generation (RAG) tool designed to analyze how and why a codebase evolved over time.
Instead of reading static snapshots, it mines actual Git diffs, allowing you to ask high-impact questions like:
- “Who introduced the timeout bug..?”
- “Why was the authentication logic rewritten in 2021?”
- “When did this API contract change?”
Built for scalability, accuracy, and real-world engineering workflows.
- Parses real commit diffs (added/removed lines) — not just file snapshots.
- Understands why code changed, not just what changed.
- Generator-based architecture enables O(1) memory usage.
- Handles massive repositories (Linux, React, Kubernetes) efficiently.
- Uses OpenAI GPT-3.5 Turbo for contextual reasoning.
- Powered by LangChain + ChromaDB for semantic search over commit history.
- Switch between:
- Recent History (Fast Scan)
- Deep Excavation (Full Repo Analysis)
- Automatically generates PDF audit reports of chat sessions.
- Ideal for compliance, audits, and engineering reviews.
- Shallow Git clones (
--depth) for 99% faster fetches. - Auto-cleanup of cloned repos and vector databases.
Backend
- Python 3.10+
- GitPython (Custom Mining Engine)
- LangChain
- ChromaDB (Local Vector Store)
AI Engine
- OpenAI GPT-3.5 Turbo
Frontend
- Streamlit
DevOps / Tooling
- uv (Fast Python package manager)
- python-dotenv
┌──────────────┐
│ Git Repo │
└──────┬───────┘
↓
┌─────────────────────┐
│ Custom Miner Engine│ ← Generator-based diff processing
└──────┬──────────────┘
↓
┌─────────────────────┐
│ ChromaDB Vector │ ← Semantic indexing
└──────┬──────────────┘
↓
┌─────────────────────┐
│ OpenAI GPT-3.5 API │ ← Reasoning & answers
└──────┬──────────────┘
↓
┌─────────────────────┐
│ Streamlit UI │ ← Chat + Time Controls
└─────────────────────┘
- Paste any public GitHub repository URL
- Select:
- Fast Mode → Recent commits only
- Deep Mode → Full historical analysis
- Ask natural language questions:
- “Why was this function refactored?”
- “Who changed the authentication logic?”
- Export a PDF audit report if needed.
git clone https://github.com/kartik0905/git-archaeologist.git
cd git-archaeologistpip install -r requirements.txtCreate a .env file:
OPENAI_API_KEY=your_api_key_herestreamlit run app.py.
├── app.py # Streamlit UI & user interaction
├── miner.py # Core mining engine (diff parsing, batching)
├── vector_store.py # ChromaDB integration
├── prompts/ # LLM prompt templates
├── utils/ # Helpers & cleanup logic
├── reports/ # Generated PDF audit reports
└── requirements.txt
- Designed like a real production system, not a demo.
- Handles large-scale repositories efficiently.
- Solves a real developer pain point — understanding legacy code.
- Built with extensibility in mind (CI analysis, PR reviews, blame tracking).
MIT License
Star the repository and feel free to contribute or fork it for your own tooling.
Built with engineering discipline, not just prompts.