GitHub Assistant

Built to solve the challenge of navigating large codebases, this repo creates an AI agent that extracts knowledge directly from any given GitHub repo, making documentation instantly searchable and accessible. The codebase is designed to download the repo, index its Markdown content with embeddings, and use an LLM agent to answer your questions with file references.

What it does

Downloads a GitHub repo ZIP (main branch) and extracts only .md/.mdx files (ingest.py).
Embeds content with sentence-transformers (multi-qa-distilbert-cos-v1) and builds a minsearch vector index.
Exposes a search tool to a PydanticAI's agent class (using gpt-4o-mini) that cites GitHub file paths in responses (search_agent.py, search_tools.py).
Offers both a CLI chat loop (main.py) and a Streamlit UI (app.py).
Logs every interaction to JSON in logs/ for review or evaluation (logs.py, eval.py).

Setup

Create and activate a Python 3.11+ virtual environment. Example with venv:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Provide an API key for the chat model (OpenAI-compatible). For example:

export OPENAI_API_KEY=your_key_here

Running the agent

CLI

Start an interactive session targeting a GitHub repo:

python main.py --repo_owner elastic --repo_name elasticsearch

Type questions; enter stop to exit. The script will download the repository, build the vector index, and answer using the search tool.

Streamlit UI

Launch the web app:

streamlit run app.py

In the sidebar, set the repo owner and name (e.g., elastic / elasticsearch), click Initialize / Rebuild Index, then ask questions in the chat box.

How it works

Ingestion – Downloads the repo ZIP from GitHub and parses Markdown files using frontmatter into records.
Indexing – Creates sentence-transformer embeddings and fits a minsearch.VectorSearch index (top‑5 results used by default).
Agent – Built with PydanticAI's Agent class using OpenAI's gpt-4o-mini; it calls the search tool before answering and injects GitHub blob links for cited files.
Logging – All conversations are written to timestamped JSON files in logs/ for auditing or evaluation.

Extras

Chunking: ingest.index_data(..., chunk=True, chunking_params={...}) will split documents with a sliding window before indexing.
Synthetic QA & eval: question_generation.py can sample repo content to generate questions; eval.py scores logged responses against a checklist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Assistant

What it does

Setup

Running the agent

CLI

Streamlit UI

How it works

Extras

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
logs		logs
.gitignore		.gitignore
README.md		README.md
app.py		app.py
eval.py		eval.py
ingest.py		ingest.py
logs.py		logs.py
main.py		main.py
question_generation.py		question_generation.py
requirements.txt		requirements.txt
search_agent.py		search_agent.py
search_tools.py		search_tools.py

vtdinh13/github-assistant

Folders and files

Latest commit

History

Repository files navigation

GitHub Assistant

What it does

Setup

Running the agent

CLI

Streamlit UI

How it works

Extras

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages