Skip to content

Releases: sandy-sp/gittxt

Release v1.7.7

14 Apr 16:59

Choose a tag to compare

🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries, reverse engineering & bundling

🚀 Gittxt: Get Text from Git — Optimized for AI

Docs
Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
License

Gittxt is an open-source tool that transforms GitHub repositories into LLM-compatible datasets.

Perfect for developers, data scientists, and AI engineers, Gittxt helps you extract and structure .txt, .json, .md content into clean, analyzable formats for use in:

  • Prompt engineering
  • Fine-tuning & retrieval
  • Codebase summarization
  • Open-source LLM workflows

💡 Why Gittxt?

Large Language Models often expect input in very specific formats. Many tools (e.g., ChatGPT, Gemini, Ollama) struggle with arbitrary GitHub URLs, complex folders, or non-text assets.

Gittxt bridges this gap by:

  • Extracting all usable text from a repo
  • Organizing it for easy ingestion by LLMs
  • Offering structured .txt, .json, .md, .zip outputs
  • Giving you full control with filtering, formatting, and plugin support

✨ Features at a Glance

  • ✅ Text extractor for code, docs, config files
  • ✅ Output: .txt, .json, .md, .zip
  • ✅ CLI and plugin system (FastAPI, Streamlit)
  • ✅ AI-ready summaries (OpenAI / Ollama)
  • ✅ Reverse engineer .txt/.json reports back into repo structure
  • .gittxtignore support
  • ✅ Async scanning for large projects
  • ✅ Works offline and in constrained compute environments

📁 Output Types

outputs/
├── txt/         # Plain text report
├── json/        # Structured metadata
├── md/          # Markdown-formatted summary
└── zip/         # Bundled results + manifest

🚀 Quickstart

Install

pip install gittxt

Run your first scan

gittxt scan https://github.com/sandy-sp/gittxt --output-format txt,json --lite --zip

Reverse engineer a summary

gittxt re outputs/project.md -o ./restored

🌐 Explore the Visual Web App

Try the hosted version (no install required!)

👉 Launch Streamlit App


📈 Gittxt for AI Workflows

  • Use it to build structured input for LLMs
  • Ideal for prompt chaining, document agents, code summarization
  • Helps transform messy repos into single-file, AI-consumable reports

📖 Full Documentation

All CLI flags, plugins, formats, and filters are documented here:

📚 Explore Gittxt Docs


🔧 Plugin Support

Gittxt supports modular plugins:

  • gittxt-api: Run via FastAPI backend
  • gittxt-streamlit: Interactive dashboard

Install & run with:

gittxt plugin install gittxt-streamlit
gittxt plugin run gittxt-streamlit

🧠 Built for Developers & AI Engineers

Created by Sandeep Paidipati, Gittxt was born out of a need to:

  • Quickly preview and summarize GitHub repos with LLMs
  • Avoid manual copying, filtering, and converting files
  • Create AI-ready datasets for learning and experimentation

🙏 Support the Project

  • ⭐️ Star this repo if it helped you
  • 🧵 Share it with your dev/AI community
  • 🤝 Contact me for collaboration or sponsorship

🔒 License

MIT License © Sandeep Paidipati


GittxtGet Text from Git — Optimized for AI

Release v1.7.5

12 Apr 05:13

Choose a tag to compare

🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries, reverse engineering & bundling

📝 Gittxt: Get text from Git repositories in AI-ready formats

Docs
Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
License


✨ What is Gittxt?

Gittxt is a powerful CLI and plugin framework that extracts structured text and metadata from Git repositories. It’s designed to help you build AI-ready datasets, analyze large codebases, and even reverse engineer report outputs.

Use it for:

  • 🔍 Curating datasets from code and documentation
  • 🗃️ Generating .txt, .json, .md, and .zip bundles
  • 📑 Extracting and classifying technical files by sub-type
  • 🧠 Analyzing size, token count, and file types
  • 🔄 Reconstructing full project trees from summary reports

🚀 Features

  • File-Type Detection (extension, MIME, content heuristic)
  • .gittxtignore Support (with --sync)
  • Subcategory Classification (docs, config, code, etc.)
  • Async File I/O for scalable performance
  • Lite Mode for minimal outputs (--lite)
  • Bundled ZIPs (--zip) with manifest, summary, README
  • Reverse Engineering from .txt, .md, .json reports
  • Plugin System: gittxt-api, gittxt-streamlit, etc.

🏗️ Installation

pip install gittxt

Or for development:

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt config install  # Optional installer

⚙️ Quickstart

gittxt scan https://github.com/sandy-sp/gittxt --output-format txt,json --zip --lite
gittxt re outputs/gittxt_summary.json

🖥️ CLI Commands

gittxt scan [OPTIONS] [REPOS]...
gittxt config [SUBCOMMANDS]
gittxt clean [--output-dir]
gittxt re REPORT_FILE [--output-dir]
gittxt plugin [list|install|run|uninstall]

🔌 Plugin System

gittxt plugin list
gittxt plugin install gittxt-api
gittxt plugin run gittxt-api

Plugins include:

  • 🧪 gittxt-api: FastAPI backend for scanning and summaries
  • 🖥️ gittxt-streamlit: Interactive visual dashboard

📦 Output Formats

<output_dir>/
├── txt/
├── json/
├── md/
├── zip/
│   ├── summary.json
│   ├── manifest.json
│   ├── outputs/
│   └── assets/

🔄 Reverse Engineer

gittxt re report.txt -o ./restored

This recreates original file structure in a ZIP from Gittxt .txt, .md, or .json reports.


📚 Documentation

Docs are now organized in a full Docs site with:

  • ✅ Getting Started
  • ✅ CLI Reference
  • ✅ API Endpoints
  • ✅ Reverse Engineering
  • ✅ Developer & Contributor Guide

🛣️ Roadmap

  • ✅ Plugin framework with API/Streamlit
  • ✅ Reverse from Gittxt reports
  • ⏳ AI-powered summaries
  • ⏳ Live web UI

🤝 Contributing

See Contributing Guide

make lint     # Code style
make test     # Run CLI + API tests

🛡️ License

MIT License © Sandeep Paidipati


Gittxt — Get text from Git repositories in AI-ready formats.

Release v1.7.3

11 Apr 07:05

Choose a tag to compare

🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

📝 Gittxt: Get text from Git repositories in AI-ready formats

Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
License


✨ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • 🔍 Curating LLM training data from source code
  • 🗃️ Converting repos into structured .txt, .json, .md, and .zip outputs
  • 📑 Extracting docs, comments, and markdown files from large monorepos
  • 🧠 Analyzing repositories by token counts, file size, and content types
  • 📦 Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


🚀 Features

  • Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • Smart Directory Tree Summaries with depth and exclude support
  • Multiple Output Formats: .txt, .json, .md, .zip
  • Lite Mode (--lite) for fast, minimal reports
  • ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • Rich Summary Tables with size, token, and type breakdowns
  • .gittxtignore support for repo-specific exclusions
  • Async File I/O for efficient scanning
  • Reverse Engineering (gittxt re) to reconstruct repositories from reports

🏗️ Installation

🐍 Using pip (stable)

pip install gittxt

📦 Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install

⚙️ Quickstart Example

# Scan and bundle
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

# Reverse engineer from report
gittxt re exports/gittxt_summary.txt

👉 This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary
  • Reconstruct original files and structure from a Gittxt report

More examples → Usage Examples


🖥️ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

📦 Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


Reverse Engineer Command

gittxt re [OPTIONS] REPORT_FILE

🔄 Reconstruct original files and structure from Gittxt .txt, .md, or .json reports. Outputs a ZIP with recovered content.

Options

Option Description
-o, --output-dir Custom output directory for reconstructed files

Example Usage

gittxt re path/to/report.txt

This will:

  • Take a Gittxt-generated report (.txt, .md, or .json)
  • Reconstruct the original file structure as a ZIP archive
  • Save the ZIP to the specified output directory or the current directory by default

📘 Learn more → Reverse Engineering Guide


📦 Output Formats

Each scan produces structured outputs:

<output_dir>/
├── text/              # .txt
├── json/              # .json
├── md/                # .md
├── zips/              # .zip (optional)
│   └── manifest.json, summary.json, outputs/, assets/

See Formats Guide


🛠 How It Works

  1. 🔗 Clone repo (local or GitHub, with branch/subdir support)
  2. 🌲 Walk repo with filtering and MIME rules
  3. 📑 Classify TEXTUAL vs NON-TEXTUAL
  4. 📝 Format output to .txt, .json, .md
  5. 📦 Bundle ZIP with summary + manifest (optional)
  6. 🧹 Clean temp state after scan

🧰 Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt config install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


📄 Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details → docs/CONFIGURATION.md


🔐 Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


🤝 Contributing

We welcome contributions from the community!


🛣️ Roadmap

  • ✅ Async file scanning
  • ✅ ZIP archive export with manifest
  • ✅ Lite mode output
  • ⏳ AI-powered summaries (GPT, Claude)
  • ⏳ YAML + CSV output support
  • ⏳ Web UI via FastAPI

📄 License

MIT License © Sandeep Paidipati


Gittxt — Get text from Git repositories in AI-ready formats.

Release v1.7.2

06 Apr 06:59

Choose a tag to compare

🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

📝 Gittxt: Get text from Git repositories in AI-ready formats

Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
License


✨ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • 🔍 Curating LLM training data from source code
  • 🗃️ Converting repos into structured .txt, .json, .md, and .zip outputs
  • 📑 Extracting docs, comments, and markdown files from large monorepos
  • 🧠 Analyzing repositories by token counts, file size, and content types
  • 📦 Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


🚀 Features

  • Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • Smart Directory Tree Summaries with depth and exclude support
  • Multiple Output Formats: .txt, .json, .md, .zip
  • Lite Mode (--lite) for fast, minimal reports
  • ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • Rich Summary Tables with size, token, and type breakdowns
  • .gittxtignore support for repo-specific exclusions
  • Async File I/O for efficient scanning
  • Reverse Engineering (gittxt re) to reconstruct repositories from reports

🏗️ Installation

🐍 Using pip (stable)

pip install gittxt

📦 Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install

⚙️ Quickstart Example

# Scan and bundle
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

# Reverse engineer from report
gittxt re exports/gittxt_summary.txt

👉 This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary
  • Reconstruct original files and structure from a Gittxt report

More examples → Usage Examples


🖥️ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

📦 Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


Reverse Engineer Command

gittxt re [OPTIONS] REPORT_FILE

🔄 Reconstruct original files and structure from Gittxt .txt, .md, or .json reports. Outputs a ZIP with recovered content.

Options

Option Description
-o, --output-dir Custom output directory for reconstructed files

Example Usage

gittxt re path/to/report.txt

This will:

  • Take a Gittxt-generated report (.txt, .md, or .json)
  • Reconstruct the original file structure as a ZIP archive
  • Save the ZIP to the specified output directory or the current directory by default

📘 Learn more → Reverse Engineering Guide


📦 Output Formats

Each scan produces structured outputs:

<output_dir>/
├── text/              # .txt
├── json/              # .json
├── md/                # .md
├── zips/              # .zip (optional)
│   └── manifest.json, summary.json, outputs/, assets/

See Formats Guide


🛠 How It Works

  1. 🔗 Clone repo (local or GitHub, with branch/subdir support)
  2. 🌲 Walk repo with filtering and MIME rules
  3. 📑 Classify TEXTUAL vs NON-TEXTUAL
  4. 📝 Format output to .txt, .json, .md
  5. 📦 Bundle ZIP with summary + manifest (optional)
  6. 🧹 Clean temp state after scan

🧰 Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt config install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


📄 Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details → docs/CONFIGURATION.md


🔐 Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


🤝 Contributing

We welcome contributions from the community!


🛣️ Roadmap

  • ✅ Async file scanning
  • ✅ ZIP archive export with manifest
  • ✅ Lite mode output
  • ⏳ AI-powered summaries (GPT, Claude)
  • ⏳ YAML + CSV output support
  • ⏳ Web UI via FastAPI

📄 License

MIT License © Sandeep Paidipati


Gittxt — Get text from Git repositories in AI-ready formats.

Release v1.7.0

03 Apr 14:43

Choose a tag to compare

🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

📝 Gittxt: Get text from Git repositories in AI-ready formats

Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
License


✨ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • 🔍 Curating LLM training data from source code
  • 🗃️ Converting repos into structured .txt, .json, .md, and .zip outputs
  • 📑 Extracting docs, comments, and markdown files from large monorepos
  • 🧠 Analyzing repositories by token counts, file size, and content types
  • 📦 Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


🚀 Features

  • Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • Smart Directory Tree Summaries with depth and exclude support
  • Multiple Output Formats: .txt, .json, .md, .zip
  • Lite Mode (--lite) for fast, minimal reports
  • ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • Rich Summary Tables with size, token, and type breakdowns
  • .gittxtignore support for repo-specific exclusions
  • Async File I/O for efficient scanning

🏗️ Installation

🐍 Using pip (stable)

pip install gittxt

📦 Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install

⚙️ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

👉 This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary

More examples → Usage Examples


🖥️ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

📦 Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


📦 Output Formats

Each scan produces structured outputs:

<output_dir>/
├── text/              # .txt
├── json/              # .json
├── md/                # .md
├── zips/              # .zip (optional)
│   └── manifest.json, summary.json, outputs/, assets/

See Formats Guide


🛠 How It Works

  1. 🔗 Clone repo (local or GitHub, with branch/subdir support)
  2. 🌲 Walk repo with filtering and MIME rules
  3. 📑 Classify TEXTUAL vs NON-TEXTUAL
  4. 📝 Format output to .txt, .json, .md
  5. 📦 Bundle ZIP with summary + manifest (optional)
  6. 🧹 Clean temp state after scan

🧰 Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt config install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


📄 Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details → docs/CONFIGURATION.md


🔐 Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


🤝 Contributing

We welcome contributions from the community!


🛣️ Roadmap

  • ✅ Async file scanning
  • ✅ ZIP archive export with manifest
  • ✅ Lite mode output
  • ⏳ AI-powered summaries (GPT, Claude)
  • ⏳ YAML + CSV output support
  • ⏳ Web UI via FastAPI

📄 License

MIT License © Sandeep Paidipati


Gittxt — Get text from Git repositories in AI-ready formats.

Release v1.6.0

01 Apr 07:31

Choose a tag to compare

🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling

📝 Gittxt: Get text from Git repositories in AI-ready formats

Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
License


✨ What is Gittxt?

Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.

With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:

  • 🔍 Curating LLM training data from source code
  • 🗃️ Converting repos into structured .txt, .json, .md, and .zip outputs
  • 📑 Extracting docs, comments, and markdown files from large monorepos
  • 🧠 Analyzing repositories by token counts, file size, and content types
  • 📦 Bundling outputs for reproducibility and downstream pipelines

It supports both local folders and GitHub URLs with branch/subdir targeting.


🚀 Features

  • Dynamic File-Type Filtering (extension + MIME + content heuristics)
  • Smart Directory Tree Summaries with depth and exclude support
  • Multiple Output Formats: .txt, .json, .md, .zip
  • Lite Mode (--lite) for fast, minimal reports
  • ZIP Bundling with --zip, including summary.json, manifest.json, and assets
  • Rich Summary Tables with size, token, and type breakdowns
  • .gittxtignore support for repo-specific exclusions
  • Async File I/O for efficient scanning

🏗️ Installation

📦 Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional setup
poetry run gittxt install

🐍 Using pip (stable)

pip install gittxt

⚙️ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite

👉 This will:

  • Scan the repository root
  • Output .txt and .json summary files
  • Bundle outputs in a ZIP with manifest and summary

More examples → Usage Examples


🖥️ CLI Usage

gittxt scan [OPTIONS] [REPOS]...

📦 Scan directories or GitHub repos (textual only).

Options

Option Description
-x, --exclude-dir Exclude folder paths
-o, --output-dir PATH Custom output directory
-f, --output-format TEXT Comma-separated: txt, json, md
-i, --include-patterns TEXT Glob to include (only textual)
-e, --exclude-patterns TEXT Glob to exclude
--zip Create a ZIP bundle
--lite Generate minimal output instead of full content
--sync Opt-in to .gitignore usage
--size-limit INTEGER Max file size in bytes
--branch TEXT Git branch for remote repos
--tree-depth INTEGER Limit tree output to N levels
--log-level [debug|info|warning|error] Set log verbosity level
--help Show CLI help and exit

Run gittxt scan --help for the full reference.


📦 Output Formats

Each scan produces structured outputs:

<output_dir>/
├── text/              # .txt
├── json/              # .json
├── md/                # .md
├── zips/              # .zip (optional)
│   └── manifest.json, summary.json, outputs/, assets/

See Formats Guide


🛠 How It Works

  1. 🔗 Clone repo (local or GitHub, with branch/subdir support)
  2. 🌲 Walk repo with filtering and MIME rules
  3. 📑 Classify TEXTUAL vs NON-TEXTUAL
  4. 📝 Format output to .txt, .json, .md
  5. 📦 Bundle ZIP with summary + manifest (optional)
  6. 🧹 Clean temp state after scan

🧰 Gittxt Installer

Run the interactive installer to configure Gittxt preferences:

gittxt install

This command lets you:

  • Set default output directory and formats (txt/json/md)
  • Configure log level (DEBUG, INFO, WARNING, ERROR)
  • Enable or disable automatic ZIP bundling
  • Define or override:
    • Textual extensions (e.g. .py, .md)
    • Non-textual extensions (e.g. .png, .zip)
    • Excluded directories (e.g. .git, node_modules)

The config is saved to gittxt-config.json and used as default for all scans.


📄 Configuration

  • CLI flags (e.g., --output-dir, --size-limit)
  • Environment variables (e.g., GITTXT_OUTPUT_DIR)
  • .gittxtignore file support for exclusions

Config details → docs/CONFIGURATION.md


🔐 Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines


🤝 Contributing

We welcome contributions from the community!


🛣️ Roadmap

  • ✅ Async file scanning
  • ✅ ZIP archive export with manifest
  • ✅ Lite mode output
  • ⏳ AI-powered summaries (GPT, Claude)
  • ⏳ YAML + CSV output support
  • ⏳ Web UI via FastAPI

📄 License

MIT License © Sandeep Paidipati


Gittxt — Get text from Git repositories in AI-ready formats.

Release v1.5.9

01 Apr 04:08

Choose a tag to compare

Release v1.5.9

Release v1.5.8

01 Apr 03:36

Choose a tag to compare

test 1.5.8

Release v1.5.0

17 Mar 18:19

Choose a tag to compare

🚀 LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines

📝 Gittxt: Get text from Git repositories in AI-ready formats.

Python Version
PyPI version
Release
Tested with Pytest
PyPI Downloads
GitHub repo size
GitHub top language
Build Status
Made for LLMs
Linted with Ruff
License


✨ What is Gittxt?

Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.

Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:

  • Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
  • Documentation extraction for knowledge bases
  • Code summarization pipelines
  • Repository analysis for machine learning workflows

🚀 Features

  • Dynamic File-Type Filtering (--file-types=code,docs,images,csv,media,all)
  • Automatic Tree Generation with clean filtering (excludes .git/, __pycache__, etc.)
  • Multiple Output Formats: TXT, JSON, Markdown
  • Optional ZIP Packaging for non-text assets
  • CLI-friendly Progress Bars
  • Built-in Summary Reports (--summary)
  • Interactive & CI-ready Modes (--non-interactive)

🏗️ Installation

📦 Using Poetry

git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install

🐍 Using pip (stable)

pip install gittxt

⚙️ Quickstart Example

gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --file-types code,docs --summary

👉 This will:

  • Scan a GitHub repository
  • Extract code & docs files
  • Output .txt + .json summaries
  • Show a summary report

🖥️ CLI Usage

gittxt scan [REPOS]... [OPTIONS]

Options:
  --include TEXT        Include patterns (e.g., *.py)
  --exclude TEXT        Exclude patterns (e.g., tests/, node_modules)
  --size-limit INTEGER  Max file size in bytes
  --branch TEXT         Specify branch (for GitHub URLs)
  --file-types TEXT     code, docs, images, csv, media, all
  --output-format TEXT  txt, json, md, or comma-separated list
  --output-dir PATH     Custom output directory
  --summary             Show post-scan summary
  --non-interactive     Skip prompts for CI/CD workflows
  --progress            Enable scan progress bars
  --debug               Enable debug logs
  --help                Show this message and exit

📂 Output Structure

<output_dir>/
├── text/
│   └── repo-name.txt
├── json/
│   └── repo-name.json
├── md/
│   └── repo-name.md
└── zips/
    └── repo-name_bundle.zip  # Optional ZIP for assets (images, csv, etc.)

🛠 How It Works

  1. 🔗 Clone GitHub/local repo (supports branch/subdir URLs)
  2. 🌳 Dynamically generate directory tree (excluding .git, __pycache__, etc.)
  3. 🗂️ Filter files based on type (code, docs, csv, media)
  4. 📝 Generate formatted outputs (TXT, JSON, MD)
  5. 📦 Package assets (optional ZIP for non-text)
  6. 🧹 Cleanup temporary files (cache-free design)

📊 Example Summary Output

📊 Summary Report:
 - Total files processed: 45
 - Output formats: txt, json
 - File type breakdown: {'code': 31, 'docs': 14}

🔐 Security Policy

Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy


🤝 Contributing

We welcome community contributions!


🛣️ Roadmap

  • FastAPI-powered web UI
  • AI-powered summaries (GPT/OpenAI integration)
  • Support YAML/CSV as additional output formats
  • Async file scanning (speed boost)

📄 License

MIT License © Sandeep Paidipati


Gittxt — “Gittxt: Get text from Git repositories in AI-ready formats.”


Release v1.4.1

12 Mar 16:07

Choose a tag to compare

🚀 Gittxt: Get Text of Your Repo for AI, LLMs & Docs!

Release
PyPI version
PyPI Downloads
GitHub repo size
GitHub top language
License: MIT
Build Status

Gittxt is a lightweight CLI tool that extracts text from Git repositories and formats it into AI-friendly outputs (.txt, .json, .md). Whether you’re using ChatGPT, Grok, Ollama or any LLM, Gittxt helps you process repositories for insights, training, and documentation.


✨ Why Use Gittxt?

  • Extract Readable Text: Easily pull text from code, docs, and other repository files.
  • AI-Friendly Outputs: Generate outputs in TXT, JSON, and Markdown for different use cases.
  • Efficient Processing: Faster scanning with incremental caching.
  • Flexible Filtering: Use advanced flags like --docs-only and --auto-filter to control what’s extracted.
  • Multi-Repository Support: Scan one or more repositories in a single command.

🆕 Release v1.4.1

New Features & Enhancements

  • Interactive Installation:
    Use the new gittxt install subcommand to set up your configuration (output directory, logging preferences, etc.) interactively.

  • Multi-Repository Scanning:
    Scan multiple repositories at once, whether they are local or remote.

  • Advanced Filtering Options:

    • --docs-only: Extract only documentation files (e.g., README, docs/ folder, etc.).
    • --auto-filter: Automatically skip common unwanted or binary files.
  • Multi-Format Output:
    Specify multiple output formats simultaneously (e.g., --output-format txt,json,md).

  • Enhanced Summary Reports:
    Outputs include summary statistics and an estimated token count for further AI processing.

  • Improved Logging & Caching:
    Faster, more accurate scanning with incremental caching and a rotating log file system.

  • Improved Token Estimation:
    Enhanced token counting algorithm with better accuracy for LLM processing, including support for CamelCase, special characters, and subword tokenization patterns.


📥 Installation

Via PIP

pip install gittxt==1.4.1

First-Time Setup (Interactive)

After installing, run:

gittxt install

This command will prompt you to configure:

  • Your default output directory (automatically set based on your OS, e.g., ~/Gittxt/ on Linux/Mac)
  • Logging level and file logging preferences

📌 How to Use Gittxt

1. Scanning Repositories

Use the scan subcommand to extract text and generate outputs.

Scan a Local Repository

gittxt scan .

Extracts all readable text into the default output directories.

Scan a Remote GitHub Repository

gittxt scan https://github.com/sandy-sp/sandy-sp

Automatically clones the repository, scans it, and extracts text.

Scan Multiple Repositories with Advanced Options

gittxt scan /path/to/repo1 https://github.com/user/repo2 --output-format txt,json --docs-only --auto-filter --summary

🔧 CLI Options

Option Description
--include Include only files matching these patterns.
--exclude Exclude files matching these patterns.
--size-limit Exclude files larger than the specified size (in bytes).
--branch Specify a Git branch (for remote repositories).
--output-dir Override the default output directory.
--output-format Comma-separated list of output formats (e.g., txt,json,md).
--max-lines Limit the number of lines per file.
--summary Display a summary report after scanning.
--debug Enable debug mode for detailed logging.
--docs-only Only extract documentation files (e.g., README, docs folder).
--auto-filter Automatically skip common unwanted or binary files.

📄 Output Formats

  • TXT: Simple text extraction for AI chat and quick analysis.
  • JSON: Structured output ideal for LLM training and data preprocessing.
  • Markdown (MD): Neatly formatted documentation for GitHub or project READMEs.

When specifying multiple formats (e.g., --output-format txt,json), Gittxt generates separate files in their respective output directories.


🗂 Directory Structure

By default, outputs are stored in your configured output directory, which is organized as follows:

<output_dir>/
  ├── text/    # Plain text outputs (.txt)
  ├── json/    # JSON outputs (.json)
  ├── md/      # Markdown outputs (.md)
  └── cache/   # Caching for incremental scans

⚙️ Configuration

Gittxt uses a configuration file (gittxt-config.json) to store user preferences. You can update this configuration via the interactive install command:

gittxt install

Or edit the file manually. Key settings include:

  • Output Directory: Auto-determined based on your OS (e.g., ~/Gittxt/).
  • Logging Options: Logging level and file logging preferences.
  • Filtering Options: Include/exclude patterns, file size limits, etc.

📌 Contribute & Develop

  1. Run Tests:
    pytest tests/
  2. Format Code:
    black src/
  3. Submit a PR:
    • Fork the repo.
    • Create a new branch (e.g., feature/my-change).
    • Push your changes.
    • Submit a PR.

For more details, see the Contributing Guide.


💡 Future Roadmap

Our future plans include enhancements to the user interface and further AI-based features. We’re working on a lightweight web-based UI and additional improvements that streamline repository analysis and documentation extraction.


📜 License

Gittxt is licensed under the MIT License.


Made by Sandeep Paidipati

🚀 Gittxt: Get Text of Your Repo for AI, LLMs & Docs!