Built to solve the challenge of navigating large codebases, this repo creates an AI agent that extracts knowledge directly from any given GitHub repo, making documentation instantly searchable and accessible. The codebase is designed to download the repo, index its Markdown content with embeddings, and use an LLM agent to answer your questions with file references.
- Downloads a GitHub repo ZIP (main branch) and extracts only
.md/.mdxfiles (ingest.py). - Embeds content with
sentence-transformers(multi-qa-distilbert-cos-v1) and builds aminsearchvector index. - Exposes a search tool to a PydanticAI's agent class (using
gpt-4o-mini) that cites GitHub file paths in responses (search_agent.py,search_tools.py). - Offers both a CLI chat loop (
main.py) and a Streamlit UI (app.py). - Logs every interaction to JSON in
logs/for review or evaluation (logs.py,eval.py).
- Create and activate a Python 3.11+ virtual environment. Example with
venv:
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Provide an API key for the chat model (OpenAI-compatible). For example:
export OPENAI_API_KEY=your_key_hereStart an interactive session targeting a GitHub repo:
python main.py --repo_owner elastic --repo_name elasticsearchType questions; enter stop to exit. The script will download the repository, build the vector index, and answer using the search tool.
Launch the web app:
streamlit run app.pyIn the sidebar, set the repo owner and name (e.g., elastic / elasticsearch), click Initialize / Rebuild Index, then ask questions in the chat box.
- Ingestion – Downloads the repo ZIP from GitHub and parses Markdown files using frontmatter into records.
- Indexing – Creates sentence-transformer embeddings and fits a
minsearch.VectorSearchindex (top‑5 results used by default). - Agent – Built with PydanticAI's
Agentclass using OpenAI'sgpt-4o-mini; it calls the search tool before answering and injects GitHub blob links for cited files. - Logging – All conversations are written to timestamped JSON files in
logs/for auditing or evaluation.
- Chunking:
ingest.index_data(..., chunk=True, chunking_params={...})will split documents with a sliding window before indexing. - Synthetic QA & eval:
question_generation.pycan sample repo content to generate questions;eval.pyscores logged responses against a checklist.