Skip to content

salmanstack/git_summarize

Repository files navigation

GitHub Project Summarizer

The GitHub Project Summarizer is a FastAPI-based tool that takes any public (or private, with token) GitHub repository URL and returns a concise summary of the project. It clones the repository into a temporary folder, analyzes its file structure, and uses a large language model (LLM) in an agentic loop to iteratively read and summarize the most relevant files. The final output is a JSON containing the project’s overall summary, detected technologies, and repository structure.

How It Works

  • Clone Repository: The service clones the given GitHub repo into a temporary directory (default: ${TEMP_DIR:-/tmp/github_project_summarizer}). For private repositories a GitHub token can be supplied via the GITHUB_TOKEN environment variable or passed in the request payload under token. Edge cases (invalid URL, clone errors, submodules, or shallow clone needs) are caught and returned as clear error messages. The temp folder is removed on successful completion.

  • Scan File Tree: The cloned repo is scanned to build a directory tree and a file map (extensions, paths, sizes). Files filtered as "non-important" are excluded based on configurable rules: default exclusions include .git/, node_modules/, common binary/media extensions (.png,.jpg,.exe,.zip), and files larger than a default size threshold (MAX_FILE_SIZE_MB, default 5 MB). These filters are configurable in utils/filter_file_map.py.

  • Build Initial Context: To establish the initial context for the agentic workflow, the service scans all README-like files (README* in top-level and docs/ index files). If multiple README files exist, they are concatenated to form the initial summary that bootstraps the agentic loop.

  • Core LLM / Agentic Components:

    • File Summarizer Workflow:
      This workflow processes any given file by first breaking it into manageable chunks to respect the model’s context window. Each chunk is summarized individually using the LLM, and these intermediate summaries are progressively combined to build a comprehensive understanding of the file. The result is a concise yet information-dense file-level summary that captures key logic, structure, and intent.

    • Repository Scan Agent:

      This agentic framework incrementally enhances its understanding of the repository by intelligently selecting which files to analyze next. It leverages the File Summarizer Workflow as a tool. The agent selects up to N files per iteration (default N = 5) and runs up to max_iters iterations (default max_iters = 10), reading up to N * max_iters files (default 50). These defaults are defined in utils/llm_setup.py and can be overridden by request parameters.

      • The current repository summary (built from all context up to the previous iteration).
      • The repository file tree.
      • Additional summarized file insights.

      Using this context, the LLM agent (via LangGraph) selects up to N new files that it determines will most improve the overall repository understanding. This dynamic, relevance-driven selection ensures the system prioritizes the most informative parts of the codebase first.

    • Parallel File Summarization:
      The selected N files are summarized concurrently using the File Summarizer sub-graph. Each file is chunked to respect context limitations and processed independently. Once all summaries are completed, they are returned to the Repository Scan Agent to update the global context.

    • Iterative Refinement:
      After incorporating the newly summarized files, the agent evaluates whether additional files should be analyzed. The loop continues until either:

      • The agent determines the summary is sufficiently complete, or
      • The maximum number of iterations (max_iters) is reached.

      The prompting strategy evolves across iterations:

      • First iteration: The agent is required to select files to ensure sufficient initial context.
      • Intermediate iterations: The agent may choose to continue scanning or stop early.
      • Final iteration: The agent is explicitly instructed to finalize the summary without selecting additional files.
    • Result Assembly:
      Once the loop concludes, the system compiles:

      • summary — A consolidated, high-level overview of the repository
      • technologies — Detected languages, frameworks, and tools
      • structure — The repository’s file tree representation
      • Additional metadata — Such as number of iterations executed and files processed

      The final output is returned as a structured JSON response.

  • Logging & Debugging: Throughout this process, detailed logs are kept. The code uses a clean, modular logging configuration so you can trace what’s happening at each step. If a LangSmith API key is configured, all LLM calls and agent decisions are automatically traced in LangSmith for full observability of the summarization process. This helps debug or audit exactly how the summary was generated.

  • LLM Setup: The tool supports multiple LLM providers. Configure provider selection and keys via environment variables:

    • LLM_PROVIDER — e.g. groq or nebius
    • GROQ_API_KEY — required if LLM_PROVIDER=groq
    • NEBIUS_API_KEY — required if LLM_PROVIDER=nebius Provider-specific keys are read from .env via python-dotenv. See utils/llm_setup.py for where to change provider defaults and parameters such as LLM_PARALLEL_CALLS.
  • Prompt Engineering: The LLM prompts are carefully designed (in the agents/file_summary_prompts.py and agents/global_summary_prompts.py modules) to guide the agent’s behavior. For instance, the first-time global prompt forces the agent to always pick files (so the summary starts from something), while later prompts allow it to say “no more files needed”. A final prompt (when reaching max iterations) explicitly tells the agent not to continue further. This ensures the agentic loop behaves as intended at each stage.

  • Context Window Management: To respect the model’s context window, any file’s content is split into chunks. The summarizer reads and summarizes each chunk, then combines them. In each iteration, the combined context (existing summary + tree + recent file summaries) is limited by chunk size and count, so the prompt to the LLM never exceeds what it can handle.

The system does not execute repository code. All analysis is static and read-only.

Execution Flow

  1. Clone repository
  2. Build filtered file tree
  3. Summarize README files
  4. Agent selects N files
  5. Parallel file summarization
  6. Iterate until stop → Return final JSON

Installation

  1. Clone this repository locally and create and activate a virtual env
python -m venv .llm_chunker_env
source .llm_chunker_env/Scripts/activate
  1. Dependencies: Install required Python packages:
pip install -r requirements.txt
  1. Environment: Create a .env file in the project root containing your API key(s). For example, if using Groq:
NEBIUS_API_KEY=your_neibus_api_key_here
GITHUB_TOKEN=your_github_token_here   # optional, for private repos
LANGSMITH_API_KEY=your_langsmith_key_here  # optional
  1. Run the app: Start the FastAPI server:
uvicorn main:app --reload

This will serve the summarizer on http://127.0.0.1:8000

Usage

  • Endpoint: POST /summarize with JSON body containing github_url.
  • Example – Valid URL
curl -X POST http://127.0.0.1:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/psf/requests"}'

OR

curl -X POST http://127.0.0.1:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/pymc-devs/pymc"}'

This will return a JSON response containing: - The project summary - Detected technologies - Repository structure

  • Example – Invalid URL

If the URL is malformed or the repository does not exist, the service returns a 400/500 error response.

curl -X POST http://127.0.0.1:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/psf/reqsdsuests"}'

Example error response:

{
  "status": "error",
  "message": "Repository not found"
}

Output

The response JSON includes:

  • summary: A high-level textual overview of the project’s purpose and functionality.

  • technologies: A list of detected languages and frameworks (e.g., ["Python", "Docker"]).

  • structure: A representation of the repository’s directory and file tree.

  • Additional metadata (number of files processed, iterations, etc.) may also be included for transparency.

Internal Architecture

  • Modular Design: Utility code is in utils/ (for cloning, scanning, loading files, configuring the LLM, etc.). Agent logic lives in agents/ with separate prompt definitions. This separation makes it easy to adjust LLM settings or prompts without touching the core flow.

  • LangGraph/Agents: The summarization flow is implemented using LangGraph (from LangChain) graphs. There is a file summary graph (summarizing one file) and a repository summary graph (orchestrating the agent loop). Each iteration of the repo graph involves a call to an LLM agent to pick files and a parallel launch of the file summary graph for those files.

  • Extensibility: You can tweak settings like max_iterations, LLM_PARALLEL_CALLS, or token limits in utils/llm_setup.py. The prompts in agents/*.py can be customized to change how the agent prioritizes content.

This design makes it easy to:

  • Swap LLM providers
  • Modify prompt strategies
  • Adjust summarization behavior
  • Debug specific components independently

LangGraph / Agent-Based Workflow

The summarization flow is implemented using LangGraph (built on LangChain).

There are two primary graphs:

  1. File Summary Graph

    • Summarizes individual files.
    • Handles chunking to respect context window limits.
  2. Repository Summary Graph

    • Orchestrates the agent loop.
    • Uses an LLM agent to:
      • Select the next N files (N = 5 by default).
      • Trigger parallel summarization.
      • Decide whether to continue or stop.
    • Stops when:
      • max_iterations is reached, or
      • The agent determines the summary is sufficient.

Each iteration:

  • Updates the global context.
  • Incorporates new file summaries.
  • Re-evaluates remaining relevant files.

Extensibility

You can adjust:

  • max_iterations
  • LLM_PARALLEL_CALLS
  • Token and chunk limits
  • Prompt strategies in agents/*.py

Configuration values are primarily located in:

utils/llm_setup.py

This allows tuning for:

  • Cost optimization
  • Speed vs. depth trade-offs
  • Different model context window sizes

Tracing and Logs

All major steps emit debug/info logs (see logs/application.py). If you configure a LangSmith API key, each LLM and agent call is logged to LangSmith, providing an end-to-end trace of how the summary was built (useful for debugging or auditing)

About

Summarize any repo with llm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors