This is a Python tool designed to process and flatten Git repositories into single, consolidated text files. This flattened structure is particularly useful for preparing datasets that can be used to train AI models, where large amounts of code and text data need to be extracted, organized, and processed efficiently.
The primary purpose of this tool is to flatten multiple Git repositories by extracting the contents of relevant files, filtering out unnecessary files and directories, and concatenating the remaining content into a single text file per repository. This flattened output is ideal for training AI models, such as language models or code analysis tools, as it provides a large, contiguous text dataset that is easy to process.
- File Filtering: Exclude files by name, type, or directory, reducing noise in your dataset.
- Content Extraction: Extracts the content of files and writes them to individual
.txtfiles with unique names based on an MD5 hash. - Concatenation: Merges all generated
.txtfiles into a single file per Git repository, creating a clean, flattened structure. - Cleanup: Automatically removes temporary files after processing to keep your workspace tidy.
- Python 3.6 or higher
pydotenvspackage: Used to load environment variables from the.envfile.
-
Clone the repository:
git clone https://github.com/andyleenz/git-to-single-txt cd git-to-single-txt -
Install the required Python packages:
python -m pip install -r requirements.txt
-
Create a
.envfile in the root directory of the project. Use the following template:# Directories to process (comma-separated) GIT_PROJECT_DIRECTORIES=/path/to/git/project1,/path/to/git/project2 # Files to ignore (comma-separated) IGNORE_FILES=.gitignore,README.md # File types to ignore (comma-separated) IGNORE_FILE_TYPES=.log,.bin # Directories to ignore (comma-separated) IGNORE_DIRS=node_modules,.git # Directory where the results will be saved SAVE_DIRECTORY=/path/to/save/directory # Optionally, skip empty files (TRUE/FALSE) SKIP_EMPTY_FILES=TRUE
Run the script with:
python main.pyThe script will process each Git repository specified in the .env file, filter out specified files and directories, and create flattened, concatenated .txt files in the specified SAVE_DIRECTORY.
Given the following .env file:
GIT_PROJECT_DIRECTORIES=/home/user/projects/project1,/home/user/projects/project2
IGNORE_FILES=.gitignore
IGNORE_FILE_TYPES=.log
IGNORE_DIRS=node_modules,.git
SAVE_DIRECTORY=/home/user/output
SKIP_EMPTY_FILES=TRUEThe script will:
- Process
/home/user/projects/project1and/home/user/projects/project2. - Exclude any files with a
.logextension or named.gitignore. - Ignore the
node_modulesand.gitdirectories. - Save the flattened output files in
/home/user/output.
- Ensure the Git repository paths and save directory paths in the
.envfile are correct. - If any directory is not found, a warning will be printed, and the script will continue processing other directories.
- Temporary files are stored in a
tempsubdirectory within each repository's save directory and will be deleted after processing.
This project is licensed under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request or open an issue on GitHub.
For any questions or suggestions, please contact hi@andrewlee.co.