Releases: sandy-sp/gittxt
Release v1.7.7
🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries, reverse engineering & bundling
🚀 Gittxt: Get Text from Git — Optimized for AI
Gittxt is an open-source tool that transforms GitHub repositories into LLM-compatible datasets.
Perfect for developers, data scientists, and AI engineers, Gittxt helps you extract and structure .txt, .json, .md content into clean, analyzable formats for use in:
- Prompt engineering
- Fine-tuning & retrieval
- Codebase summarization
- Open-source LLM workflows
💡 Why Gittxt?
Large Language Models often expect input in very specific formats. Many tools (e.g., ChatGPT, Gemini, Ollama) struggle with arbitrary GitHub URLs, complex folders, or non-text assets.
Gittxt bridges this gap by:
- Extracting all usable text from a repo
- Organizing it for easy ingestion by LLMs
- Offering structured
.txt,.json,.md,.zipoutputs - Giving you full control with filtering, formatting, and plugin support
✨ Features at a Glance
- ✅ Text extractor for code, docs, config files
- ✅ Output:
.txt,.json,.md,.zip - ✅ CLI and plugin system (FastAPI, Streamlit)
- ✅ AI-ready summaries (OpenAI / Ollama)
- ✅ Reverse engineer
.txt/.jsonreports back into repo structure - ✅
.gittxtignoresupport - ✅ Async scanning for large projects
- ✅ Works offline and in constrained compute environments
📁 Output Types
outputs/
├── txt/ # Plain text report
├── json/ # Structured metadata
├── md/ # Markdown-formatted summary
└── zip/ # Bundled results + manifest
🚀 Quickstart
Install
pip install gittxtRun your first scan
gittxt scan https://github.com/sandy-sp/gittxt --output-format txt,json --lite --zipReverse engineer a summary
gittxt re outputs/project.md -o ./restored🌐 Explore the Visual Web App
Try the hosted version (no install required!)
📈 Gittxt for AI Workflows
- Use it to build structured input for LLMs
- Ideal for prompt chaining, document agents, code summarization
- Helps transform messy repos into single-file, AI-consumable reports
📖 Full Documentation
All CLI flags, plugins, formats, and filters are documented here:
🔧 Plugin Support
Gittxt supports modular plugins:
gittxt-api: Run via FastAPI backendgittxt-streamlit: Interactive dashboard
Install & run with:
gittxt plugin install gittxt-streamlit
gittxt plugin run gittxt-streamlit🧠 Built for Developers & AI Engineers
Created by Sandeep Paidipati, Gittxt was born out of a need to:
- Quickly preview and summarize GitHub repos with LLMs
- Avoid manual copying, filtering, and converting files
- Create AI-ready datasets for learning and experimentation
🙏 Support the Project
- ⭐️ Star this repo if it helped you
- 🧵 Share it with your dev/AI community
- 🤝 Contact me for collaboration or sponsorship
🔒 License
MIT License © Sandeep Paidipati
Gittxt — Get Text from Git — Optimized for AI
Release v1.7.5
🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries, reverse engineering & bundling
📝 Gittxt: Get text from Git repositories in AI-ready formats
✨ What is Gittxt?
Gittxt is a powerful CLI and plugin framework that extracts structured text and metadata from Git repositories. It’s designed to help you build AI-ready datasets, analyze large codebases, and even reverse engineer report outputs.
Use it for:
- 🔍 Curating datasets from code and documentation
- 🗃️ Generating
.txt,.json,.md, and.zipbundles - 📑 Extracting and classifying technical files by sub-type
- 🧠 Analyzing size, token count, and file types
- 🔄 Reconstructing full project trees from summary reports
🚀 Features
- ✅ File-Type Detection (extension, MIME, content heuristic)
- ✅ .gittxtignore Support (with
--sync) - ✅ Subcategory Classification (docs, config, code, etc.)
- ✅ Async File I/O for scalable performance
- ✅ Lite Mode for minimal outputs (
--lite) - ✅ Bundled ZIPs (
--zip) with manifest, summary, README - ✅ Reverse Engineering from
.txt,.md,.jsonreports - ✅ Plugin System:
gittxt-api,gittxt-streamlit, etc.
🏗️ Installation
pip install gittxtOr for development:
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt config install # Optional installer⚙️ Quickstart
gittxt scan https://github.com/sandy-sp/gittxt --output-format txt,json --zip --lite
gittxt re outputs/gittxt_summary.json🖥️ CLI Commands
gittxt scan [OPTIONS] [REPOS]...
gittxt config [SUBCOMMANDS]
gittxt clean [--output-dir]
gittxt re REPORT_FILE [--output-dir]
gittxt plugin [list|install|run|uninstall]🔌 Plugin System
gittxt plugin list
gittxt plugin install gittxt-api
gittxt plugin run gittxt-apiPlugins include:
- 🧪
gittxt-api: FastAPI backend for scanning and summaries - 🖥️
gittxt-streamlit: Interactive visual dashboard
📦 Output Formats
<output_dir>/
├── txt/
├── json/
├── md/
├── zip/
│ ├── summary.json
│ ├── manifest.json
│ ├── outputs/
│ └── assets/
🔄 Reverse Engineer
gittxt re report.txt -o ./restoredThis recreates original file structure in a ZIP from Gittxt .txt, .md, or .json reports.
📚 Documentation
Docs are now organized in a full Docs site with:
- ✅ Getting Started
- ✅ CLI Reference
- ✅ API Endpoints
- ✅ Reverse Engineering
- ✅ Developer & Contributor Guide
🛣️ Roadmap
- ✅ Plugin framework with API/Streamlit
- ✅ Reverse from Gittxt reports
- ⏳ AI-powered summaries
- ⏳ Live web UI
🤝 Contributing
make lint # Code style
make test # Run CLI + API tests🛡️ License
MIT License © Sandeep Paidipati
Gittxt — Get text from Git repositories in AI-ready formats.
Release v1.7.3
🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling
📝 Gittxt: Get text from Git repositories in AI-ready formats
✨ What is Gittxt?
Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.
With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:
- 🔍 Curating LLM training data from source code
- 🗃️ Converting repos into structured
.txt,.json,.md, and.zipoutputs - 📑 Extracting docs, comments, and markdown files from large monorepos
- 🧠 Analyzing repositories by token counts, file size, and content types
- 📦 Bundling outputs for reproducibility and downstream pipelines
It supports both local folders and GitHub URLs with branch/subdir targeting.
🚀 Features
- ✅ Dynamic File-Type Filtering (extension + MIME + content heuristics)
- ✅ Smart Directory Tree Summaries with depth and exclude support
- ✅ Multiple Output Formats:
.txt,.json,.md,.zip - ✅ Lite Mode (
--lite) for fast, minimal reports - ✅ ZIP Bundling with
--zip, includingsummary.json,manifest.json, and assets - ✅ Rich Summary Tables with size, token, and type breakdowns
- ✅ .gittxtignore support for repo-specific exclusions
- ✅ Async File I/O for efficient scanning
- ✅ Reverse Engineering (
gittxt re) to reconstruct repositories from reports
🏗️ Installation
🐍 Using pip (stable)
pip install gittxt📦 Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install⚙️ Quickstart Example
# Scan and bundle
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite
# Reverse engineer from report
gittxt re exports/gittxt_summary.txt👉 This will:
- Scan the repository root
- Output
.txtand.jsonsummary files - Bundle outputs in a ZIP with manifest and summary
- Reconstruct original files and structure from a Gittxt report
More examples → Usage Examples
🖥️ CLI Usage
gittxt scan [OPTIONS] [REPOS]...📦 Scan directories or GitHub repos (textual only).
Options
| Option | Description |
|---|---|
-x, --exclude-dir |
Exclude folder paths |
-o, --output-dir PATH |
Custom output directory |
-f, --output-format TEXT |
Comma-separated: txt, json, md |
-i, --include-patterns TEXT |
Glob to include (only textual) |
-e, --exclude-patterns TEXT |
Glob to exclude |
--zip |
Create a ZIP bundle |
--lite |
Generate minimal output instead of full content |
--sync |
Opt-in to .gitignore usage |
--size-limit INTEGER |
Max file size in bytes |
--branch TEXT |
Git branch for remote repos |
--tree-depth INTEGER |
Limit tree output to N levels |
--log-level [debug|info|warning|error] |
Set log verbosity level |
--help |
Show CLI help and exit |
Run gittxt scan --help for the full reference.
Reverse Engineer Command
gittxt re [OPTIONS] REPORT_FILE🔄 Reconstruct original files and structure from Gittxt .txt, .md, or .json reports. Outputs a ZIP with recovered content.
Options
| Option | Description |
|---|---|
-o, --output-dir |
Custom output directory for reconstructed files |
Example Usage
gittxt re path/to/report.txtThis will:
- Take a Gittxt-generated report (
.txt,.md, or.json) - Reconstruct the original file structure as a ZIP archive
- Save the ZIP to the specified output directory or the current directory by default
📘 Learn more → Reverse Engineering Guide
📦 Output Formats
Each scan produces structured outputs:
<output_dir>/
├── text/ # .txt
├── json/ # .json
├── md/ # .md
├── zips/ # .zip (optional)
│ └── manifest.json, summary.json, outputs/, assets/
See Formats Guide
🛠 How It Works
- 🔗 Clone repo (local or GitHub, with branch/subdir support)
- 🌲 Walk repo with filtering and MIME rules
- 📑 Classify TEXTUAL vs NON-TEXTUAL
- 📝 Format output to
.txt,.json,.md - 📦 Bundle ZIP with summary + manifest (optional)
- 🧹 Clean temp state after scan
🧰 Gittxt Installer
Run the interactive installer to configure Gittxt preferences:
gittxt config installThis command lets you:
- Set default output directory and formats (txt/json/md)
- Configure log level (
DEBUG,INFO,WARNING,ERROR) - Enable or disable automatic ZIP bundling
- Define or override:
- Textual extensions (e.g.
.py,.md) - Non-textual extensions (e.g.
.png,.zip) - Excluded directories (e.g.
.git,node_modules)
- Textual extensions (e.g.
The config is saved to gittxt-config.json and used as default for all scans.
📄 Configuration
- CLI flags (e.g.,
--output-dir,--size-limit) - Environment variables (e.g.,
GITTXT_OUTPUT_DIR) .gittxtignorefile support for exclusions
Config details → docs/CONFIGURATION.md
🔐 Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines
🤝 Contributing
We welcome contributions from the community!
🛣️ Roadmap
- ✅ Async file scanning
- ✅ ZIP archive export with manifest
- ✅ Lite mode output
- ⏳ AI-powered summaries (GPT, Claude)
- ⏳ YAML + CSV output support
- ⏳ Web UI via FastAPI
📄 License
MIT License © Sandeep Paidipati
Gittxt — Get text from Git repositories in AI-ready formats.
Release v1.7.2
🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling
📝 Gittxt: Get text from Git repositories in AI-ready formats
✨ What is Gittxt?
Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.
With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:
- 🔍 Curating LLM training data from source code
- 🗃️ Converting repos into structured
.txt,.json,.md, and.zipoutputs - 📑 Extracting docs, comments, and markdown files from large monorepos
- 🧠 Analyzing repositories by token counts, file size, and content types
- 📦 Bundling outputs for reproducibility and downstream pipelines
It supports both local folders and GitHub URLs with branch/subdir targeting.
🚀 Features
- ✅ Dynamic File-Type Filtering (extension + MIME + content heuristics)
- ✅ Smart Directory Tree Summaries with depth and exclude support
- ✅ Multiple Output Formats:
.txt,.json,.md,.zip - ✅ Lite Mode (
--lite) for fast, minimal reports - ✅ ZIP Bundling with
--zip, includingsummary.json,manifest.json, and assets - ✅ Rich Summary Tables with size, token, and type breakdowns
- ✅ .gittxtignore support for repo-specific exclusions
- ✅ Async File I/O for efficient scanning
- ✅ Reverse Engineering (
gittxt re) to reconstruct repositories from reports
🏗️ Installation
🐍 Using pip (stable)
pip install gittxt📦 Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install⚙️ Quickstart Example
# Scan and bundle
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite
# Reverse engineer from report
gittxt re exports/gittxt_summary.txt👉 This will:
- Scan the repository root
- Output
.txtand.jsonsummary files - Bundle outputs in a ZIP with manifest and summary
- Reconstruct original files and structure from a Gittxt report
More examples → Usage Examples
🖥️ CLI Usage
gittxt scan [OPTIONS] [REPOS]...📦 Scan directories or GitHub repos (textual only).
Options
| Option | Description |
|---|---|
-x, --exclude-dir |
Exclude folder paths |
-o, --output-dir PATH |
Custom output directory |
-f, --output-format TEXT |
Comma-separated: txt, json, md |
-i, --include-patterns TEXT |
Glob to include (only textual) |
-e, --exclude-patterns TEXT |
Glob to exclude |
--zip |
Create a ZIP bundle |
--lite |
Generate minimal output instead of full content |
--sync |
Opt-in to .gitignore usage |
--size-limit INTEGER |
Max file size in bytes |
--branch TEXT |
Git branch for remote repos |
--tree-depth INTEGER |
Limit tree output to N levels |
--log-level [debug|info|warning|error] |
Set log verbosity level |
--help |
Show CLI help and exit |
Run gittxt scan --help for the full reference.
Reverse Engineer Command
gittxt re [OPTIONS] REPORT_FILE🔄 Reconstruct original files and structure from Gittxt .txt, .md, or .json reports. Outputs a ZIP with recovered content.
Options
| Option | Description |
|---|---|
-o, --output-dir |
Custom output directory for reconstructed files |
Example Usage
gittxt re path/to/report.txtThis will:
- Take a Gittxt-generated report (
.txt,.md, or.json) - Reconstruct the original file structure as a ZIP archive
- Save the ZIP to the specified output directory or the current directory by default
📘 Learn more → Reverse Engineering Guide
📦 Output Formats
Each scan produces structured outputs:
<output_dir>/
├── text/ # .txt
├── json/ # .json
├── md/ # .md
├── zips/ # .zip (optional)
│ └── manifest.json, summary.json, outputs/, assets/
See Formats Guide
🛠 How It Works
- 🔗 Clone repo (local or GitHub, with branch/subdir support)
- 🌲 Walk repo with filtering and MIME rules
- 📑 Classify TEXTUAL vs NON-TEXTUAL
- 📝 Format output to
.txt,.json,.md - 📦 Bundle ZIP with summary + manifest (optional)
- 🧹 Clean temp state after scan
🧰 Gittxt Installer
Run the interactive installer to configure Gittxt preferences:
gittxt config installThis command lets you:
- Set default output directory and formats (txt/json/md)
- Configure log level (
DEBUG,INFO,WARNING,ERROR) - Enable or disable automatic ZIP bundling
- Define or override:
- Textual extensions (e.g.
.py,.md) - Non-textual extensions (e.g.
.png,.zip) - Excluded directories (e.g.
.git,node_modules)
- Textual extensions (e.g.
The config is saved to gittxt-config.json and used as default for all scans.
📄 Configuration
- CLI flags (e.g.,
--output-dir,--size-limit) - Environment variables (e.g.,
GITTXT_OUTPUT_DIR) .gittxtignorefile support for exclusions
Config details → docs/CONFIGURATION.md
🔐 Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines
🤝 Contributing
We welcome contributions from the community!
🛣️ Roadmap
- ✅ Async file scanning
- ✅ ZIP archive export with manifest
- ✅ Lite mode output
- ⏳ AI-powered summaries (GPT, Claude)
- ⏳ YAML + CSV output support
- ⏳ Web UI via FastAPI
📄 License
MIT License © Sandeep Paidipati
Gittxt — Get text from Git repositories in AI-ready formats.
Release v1.7.0
🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling
📝 Gittxt: Get text from Git repositories in AI-ready formats
✨ What is Gittxt?
Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.
With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:
- 🔍 Curating LLM training data from source code
- 🗃️ Converting repos into structured
.txt,.json,.md, and.zipoutputs - 📑 Extracting docs, comments, and markdown files from large monorepos
- 🧠 Analyzing repositories by token counts, file size, and content types
- 📦 Bundling outputs for reproducibility and downstream pipelines
It supports both local folders and GitHub URLs with branch/subdir targeting.
🚀 Features
- ✅ Dynamic File-Type Filtering (extension + MIME + content heuristics)
- ✅ Smart Directory Tree Summaries with depth and exclude support
- ✅ Multiple Output Formats:
.txt,.json,.md,.zip - ✅ Lite Mode (
--lite) for fast, minimal reports - ✅ ZIP Bundling with
--zip, includingsummary.json,manifest.json, and assets - ✅ Rich Summary Tables with size, token, and type breakdowns
- ✅ .gittxtignore support for repo-specific exclusions
- ✅ Async File I/O for efficient scanning
🏗️ Installation
🐍 Using pip (stable)
pip install gittxt📦 Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional Gittxt setup
poetry run gittxt install⚙️ Quickstart Example
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite👉 This will:
- Scan the repository root
- Output
.txtand.jsonsummary files - Bundle outputs in a ZIP with manifest and summary
More examples → Usage Examples
🖥️ CLI Usage
gittxt scan [OPTIONS] [REPOS]...📦 Scan directories or GitHub repos (textual only).
Options
| Option | Description |
|---|---|
-x, --exclude-dir |
Exclude folder paths |
-o, --output-dir PATH |
Custom output directory |
-f, --output-format TEXT |
Comma-separated: txt, json, md |
-i, --include-patterns TEXT |
Glob to include (only textual) |
-e, --exclude-patterns TEXT |
Glob to exclude |
--zip |
Create a ZIP bundle |
--lite |
Generate minimal output instead of full content |
--sync |
Opt-in to .gitignore usage |
--size-limit INTEGER |
Max file size in bytes |
--branch TEXT |
Git branch for remote repos |
--tree-depth INTEGER |
Limit tree output to N levels |
--log-level [debug|info|warning|error] |
Set log verbosity level |
--help |
Show CLI help and exit |
Run gittxt scan --help for the full reference.
📦 Output Formats
Each scan produces structured outputs:
<output_dir>/
├── text/ # .txt
├── json/ # .json
├── md/ # .md
├── zips/ # .zip (optional)
│ └── manifest.json, summary.json, outputs/, assets/
See Formats Guide
🛠 How It Works
- 🔗 Clone repo (local or GitHub, with branch/subdir support)
- 🌲 Walk repo with filtering and MIME rules
- 📑 Classify TEXTUAL vs NON-TEXTUAL
- 📝 Format output to
.txt,.json,.md - 📦 Bundle ZIP with summary + manifest (optional)
- 🧹 Clean temp state after scan
🧰 Gittxt Installer
Run the interactive installer to configure Gittxt preferences:
gittxt config installThis command lets you:
- Set default output directory and formats (txt/json/md)
- Configure log level (
DEBUG,INFO,WARNING,ERROR) - Enable or disable automatic ZIP bundling
- Define or override:
- Textual extensions (e.g.
.py,.md) - Non-textual extensions (e.g.
.png,.zip) - Excluded directories (e.g.
.git,node_modules)
- Textual extensions (e.g.
The config is saved to gittxt-config.json and used as default for all scans.
📄 Configuration
- CLI flags (e.g.,
--output-dir,--size-limit) - Environment variables (e.g.,
GITTXT_OUTPUT_DIR) .gittxtignorefile support for exclusions
Config details → docs/CONFIGURATION.md
🔐 Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines
🤝 Contributing
We welcome contributions from the community!
🛣️ Roadmap
- ✅ Async file scanning
- ✅ ZIP archive export with manifest
- ✅ Lite mode output
- ⏳ AI-powered summaries (GPT, Claude)
- ⏳ YAML + CSV output support
- ⏳ Web UI via FastAPI
📄 License
MIT License © Sandeep Paidipati
Gittxt — Get text from Git repositories in AI-ready formats.
Release v1.6.0
🚀 AI-Ready Text Extractor for Git Repos | CLI tool for dataset prep, summaries & bundling
📝 Gittxt: Get text from Git repositories in AI-ready formats
✨ What is Gittxt?
Gittxt is a modular and configurable CLI tool that converts Git repositories into clean, AI-ready textual datasets. It is built for developers, researchers, and ML engineers who need structured, filtered, and summarized content from codebases and technical documentation.
With support for smart file classification, flexible exclusion logic, and multiple output formats, Gittxt is a versatile tool for:
- 🔍 Curating LLM training data from source code
- 🗃️ Converting repos into structured
.txt,.json,.md, and.zipoutputs - 📑 Extracting docs, comments, and markdown files from large monorepos
- 🧠 Analyzing repositories by token counts, file size, and content types
- 📦 Bundling outputs for reproducibility and downstream pipelines
It supports both local folders and GitHub URLs with branch/subdir targeting.
🚀 Features
- ✅ Dynamic File-Type Filtering (extension + MIME + content heuristics)
- ✅ Smart Directory Tree Summaries with depth and exclude support
- ✅ Multiple Output Formats:
.txt,.json,.md,.zip - ✅ Lite Mode (
--lite) for fast, minimal reports - ✅ ZIP Bundling with
--zip, includingsummary.json,manifest.json, and assets - ✅ Rich Summary Tables with size, token, and type breakdowns
- ✅ .gittxtignore support for repo-specific exclusions
- ✅ Async File I/O for efficient scanning
🏗️ Installation
📦 Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
# Optional setup
poetry run gittxt install🐍 Using pip (stable)
pip install gittxt⚙️ Quickstart Example
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --zip --lite👉 This will:
- Scan the repository root
- Output
.txtand.jsonsummary files - Bundle outputs in a ZIP with manifest and summary
More examples → Usage Examples
🖥️ CLI Usage
gittxt scan [OPTIONS] [REPOS]...📦 Scan directories or GitHub repos (textual only).
Options
| Option | Description |
|---|---|
-x, --exclude-dir |
Exclude folder paths |
-o, --output-dir PATH |
Custom output directory |
-f, --output-format TEXT |
Comma-separated: txt, json, md |
-i, --include-patterns TEXT |
Glob to include (only textual) |
-e, --exclude-patterns TEXT |
Glob to exclude |
--zip |
Create a ZIP bundle |
--lite |
Generate minimal output instead of full content |
--sync |
Opt-in to .gitignore usage |
--size-limit INTEGER |
Max file size in bytes |
--branch TEXT |
Git branch for remote repos |
--tree-depth INTEGER |
Limit tree output to N levels |
--log-level [debug|info|warning|error] |
Set log verbosity level |
--help |
Show CLI help and exit |
Run gittxt scan --help for the full reference.
📦 Output Formats
Each scan produces structured outputs:
<output_dir>/
├── text/ # .txt
├── json/ # .json
├── md/ # .md
├── zips/ # .zip (optional)
│ └── manifest.json, summary.json, outputs/, assets/
See Formats Guide
🛠 How It Works
- 🔗 Clone repo (local or GitHub, with branch/subdir support)
- 🌲 Walk repo with filtering and MIME rules
- 📑 Classify TEXTUAL vs NON-TEXTUAL
- 📝 Format output to
.txt,.json,.md - 📦 Bundle ZIP with summary + manifest (optional)
- 🧹 Clean temp state after scan
🧰 Gittxt Installer
Run the interactive installer to configure Gittxt preferences:
gittxt installThis command lets you:
- Set default output directory and formats (txt/json/md)
- Configure log level (
DEBUG,INFO,WARNING,ERROR) - Enable or disable automatic ZIP bundling
- Define or override:
- Textual extensions (e.g.
.py,.md) - Non-textual extensions (e.g.
.png,.zip) - Excluded directories (e.g.
.git,node_modules)
- Textual extensions (e.g.
The config is saved to gittxt-config.json and used as default for all scans.
📄 Configuration
- CLI flags (e.g.,
--output-dir,--size-limit) - Environment variables (e.g.,
GITTXT_OUTPUT_DIR) .gittxtignorefile support for exclusions
Config details → docs/CONFIGURATION.md
🔐 Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
Security Guidelines
🤝 Contributing
We welcome contributions from the community!
🛣️ Roadmap
- ✅ Async file scanning
- ✅ ZIP archive export with manifest
- ✅ Lite mode output
- ⏳ AI-powered summaries (GPT, Claude)
- ⏳ YAML + CSV output support
- ⏳ Web UI via FastAPI
📄 License
MIT License © Sandeep Paidipati
Gittxt — Get text from Git repositories in AI-ready formats.
Release v1.5.9
Release v1.5.9
Release v1.5.8
test 1.5.8
Release v1.5.0
🚀 LLM Dataset Extractor from GitHub Repos | AI & NLP-ready text pipelines
📝 Gittxt: Get text from Git repositories in AI-ready formats.
✨ What is Gittxt?
Gittxt is a developer-focused CLI tool that extracts AI-ready text from Git repositories. Whether you're preparing datasets for AI models, NLP pipelines, or LLM fine-tuning, Gittxt automates the tedious task of repository scanning and text conversion.
Built with speed, flexibility, and modularity in mind, Gittxt is ideal for:
- Preparing training data for LLMs (e.g., ChatGPT, Claude, Mistral)
- Documentation extraction for knowledge bases
- Code summarization pipelines
- Repository analysis for machine learning workflows
🚀 Features
- ✅ Dynamic File-Type Filtering (
--file-types=code,docs,images,csv,media,all) - ✅ Automatic Tree Generation with clean filtering (excludes
.git/,__pycache__, etc.) - ✅ Multiple Output Formats: TXT, JSON, Markdown
- ✅ Optional ZIP Packaging for non-text assets
- ✅ CLI-friendly Progress Bars
- ✅ Built-in Summary Reports (
--summary) - ✅ Interactive & CI-ready Modes (
--non-interactive)
🏗️ Installation
📦 Using Poetry
git clone https://github.com/sandy-sp/gittxt.git
cd gittxt
poetry install
poetry run gittxt install🐍 Using pip (stable)
pip install gittxt⚙️ Quickstart Example
gittxt scan https://github.com/sandy-sp/gittxt.git --output-format txt,json --file-types code,docs --summary👉 This will:
- Scan a GitHub repository
- Extract code & docs files
- Output
.txt+.jsonsummaries - Show a summary report
🖥️ CLI Usage
gittxt scan [REPOS]... [OPTIONS]
Options:
--include TEXT Include patterns (e.g., *.py)
--exclude TEXT Exclude patterns (e.g., tests/, node_modules)
--size-limit INTEGER Max file size in bytes
--branch TEXT Specify branch (for GitHub URLs)
--file-types TEXT code, docs, images, csv, media, all
--output-format TEXT txt, json, md, or comma-separated list
--output-dir PATH Custom output directory
--summary Show post-scan summary
--non-interactive Skip prompts for CI/CD workflows
--progress Enable scan progress bars
--debug Enable debug logs
--help Show this message and exit📂 Output Structure
<output_dir>/
├── text/
│ └── repo-name.txt
├── json/
│ └── repo-name.json
├── md/
│ └── repo-name.md
└── zips/
└── repo-name_bundle.zip # Optional ZIP for assets (images, csv, etc.)
🛠 How It Works
- 🔗 Clone GitHub/local repo (supports branch/subdir URLs)
- 🌳 Dynamically generate directory tree (excluding
.git,__pycache__, etc.) - 🗂️ Filter files based on type (code, docs, csv, media)
- 📝 Generate formatted outputs (TXT, JSON, MD)
- 📦 Package assets (optional ZIP for non-text)
- 🧹 Cleanup temporary files (cache-free design)
📊 Example Summary Output
📊 Summary Report:
- Total files processed: 45
- Output formats: txt, json
- File type breakdown: {'code': 31, 'docs': 14}
🔐 Security Policy
Please report security issues to: sandeep.paidipati@gmail.com
View Security Policy
🤝 Contributing
We welcome community contributions!
🛣️ Roadmap
- FastAPI-powered web UI
- AI-powered summaries (GPT/OpenAI integration)
- Support YAML/CSV as additional output formats
- Async file scanning (speed boost)
📄 License
MIT License © Sandeep Paidipati
Gittxt — “Gittxt: Get text from Git repositories in AI-ready formats.”
Release v1.4.1
🚀 Gittxt: Get Text of Your Repo for AI, LLMs & Docs!
Gittxt is a lightweight CLI tool that extracts text from Git repositories and formats it into AI-friendly outputs (.txt, .json, .md). Whether you’re using ChatGPT, Grok, Ollama or any LLM, Gittxt helps you process repositories for insights, training, and documentation.
✨ Why Use Gittxt?
- Extract Readable Text: Easily pull text from code, docs, and other repository files.
- AI-Friendly Outputs: Generate outputs in TXT, JSON, and Markdown for different use cases.
- Efficient Processing: Faster scanning with incremental caching.
- Flexible Filtering: Use advanced flags like
--docs-onlyand--auto-filterto control what’s extracted. - Multi-Repository Support: Scan one or more repositories in a single command.
🆕 Release v1.4.1
New Features & Enhancements
-
Interactive Installation:
Use the newgittxt installsubcommand to set up your configuration (output directory, logging preferences, etc.) interactively. -
Multi-Repository Scanning:
Scan multiple repositories at once, whether they are local or remote. -
Advanced Filtering Options:
--docs-only: Extract only documentation files (e.g., README, docs/ folder, etc.).--auto-filter: Automatically skip common unwanted or binary files.
-
Multi-Format Output:
Specify multiple output formats simultaneously (e.g.,--output-format txt,json,md). -
Enhanced Summary Reports:
Outputs include summary statistics and an estimated token count for further AI processing. -
Improved Logging & Caching:
Faster, more accurate scanning with incremental caching and a rotating log file system. -
Improved Token Estimation:
Enhanced token counting algorithm with better accuracy for LLM processing, including support for CamelCase, special characters, and subword tokenization patterns.
📥 Installation
Via PIP
pip install gittxt==1.4.1First-Time Setup (Interactive)
After installing, run:
gittxt installThis command will prompt you to configure:
- Your default output directory (automatically set based on your OS, e.g.,
~/Gittxt/on Linux/Mac) - Logging level and file logging preferences
📌 How to Use Gittxt
1. Scanning Repositories
Use the scan subcommand to extract text and generate outputs.
Scan a Local Repository
gittxt scan .Extracts all readable text into the default output directories.
Scan a Remote GitHub Repository
gittxt scan https://github.com/sandy-sp/sandy-spAutomatically clones the repository, scans it, and extracts text.
Scan Multiple Repositories with Advanced Options
gittxt scan /path/to/repo1 https://github.com/user/repo2 --output-format txt,json --docs-only --auto-filter --summary🔧 CLI Options
| Option | Description |
|---|---|
--include |
Include only files matching these patterns. |
--exclude |
Exclude files matching these patterns. |
--size-limit |
Exclude files larger than the specified size (in bytes). |
--branch |
Specify a Git branch (for remote repositories). |
--output-dir |
Override the default output directory. |
--output-format |
Comma-separated list of output formats (e.g., txt,json,md). |
--max-lines |
Limit the number of lines per file. |
--summary |
Display a summary report after scanning. |
--debug |
Enable debug mode for detailed logging. |
--docs-only |
Only extract documentation files (e.g., README, docs folder). |
--auto-filter |
Automatically skip common unwanted or binary files. |
📄 Output Formats
- TXT: Simple text extraction for AI chat and quick analysis.
- JSON: Structured output ideal for LLM training and data preprocessing.
- Markdown (MD): Neatly formatted documentation for GitHub or project READMEs.
When specifying multiple formats (e.g., --output-format txt,json), Gittxt generates separate files in their respective output directories.
🗂 Directory Structure
By default, outputs are stored in your configured output directory, which is organized as follows:
<output_dir>/
├── text/ # Plain text outputs (.txt)
├── json/ # JSON outputs (.json)
├── md/ # Markdown outputs (.md)
└── cache/ # Caching for incremental scans
⚙️ Configuration
Gittxt uses a configuration file (gittxt-config.json) to store user preferences. You can update this configuration via the interactive install command:
gittxt installOr edit the file manually. Key settings include:
- Output Directory: Auto-determined based on your OS (e.g.,
~/Gittxt/). - Logging Options: Logging level and file logging preferences.
- Filtering Options: Include/exclude patterns, file size limits, etc.
📌 Contribute & Develop
- Run Tests:
pytest tests/
- Format Code:
black src/
- Submit a PR:
- Fork the repo.
- Create a new branch (e.g.,
feature/my-change). - Push your changes.
- Submit a PR.
For more details, see the Contributing Guide.
💡 Future Roadmap
Our future plans include enhancements to the user interface and further AI-based features. We’re working on a lightweight web-based UI and additional improvements that streamline repository analysis and documentation extraction.
📜 License
Gittxt is licensed under the MIT License.
Made by Sandeep Paidipati
🚀 Gittxt: Get Text of Your Repo for AI, LLMs & Docs!



