This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
dataset-tools is a collection of Python scripts for normalizing and processing image datasets for machine learning. The tools are designed to work independently as command-line utilities, each focused on a specific dataset processing task.
pip install -r requirements.txtImportant: On macOS, it's recommended to install alongside Anaconda due to OpenCV dependencies.
python dataset-tools.py --input_folder path/to/input/ --output_folder path/to/output/ --process_type resize --max_size 512python dedupe.py --input_folder path/to/input/ --output_folder path/to/output/python multicrop.py --input_folder path/to/input/ --output_folder path/to/output/ --min_size 1024python sort.py --input_folder path/to/input/ --output_folder path/to/output/ --process_type exclude --min_size 1024python sort-color.py --input_folder path/to/input/ --output_folder path/to/output/ --threshold 40python facesort.py --input_folder path/to/input/ --output_folder path/to/output/ --method facespython obj_detect_cropper.py --input_folder path/to/input/ --output_folder path/to/output/ --bounds_file_path path/to/bounds.csv --file_format runway_csvpython openpose_face_cropper.py --input_folder path/to/input/ --output_folder path/to/output/This repository follows a flat, script-based architecture where each .py file is a standalone tool. There is no central module or package structure - each script can be run independently via command line.
All scripts follow similar conventions:
- Argument Parsing: Each script uses
argparsewithparse_args()function defining command-line arguments - Main Execution: Scripts use
if __name__ == "__main__": main()pattern - Input/Output: Standard
--input_folderand--output_folderarguments (defaults:./input/and./output/) - File Format Options: Most scripts support
--file_extensionflag forpngorjpgoutput - Verbose Mode: Most scripts include
--verboseflag for console progress output
All scripts use OpenCV (cv2) as the primary image processing library. Key patterns:
- Images are loaded with
cv2.imread(file_path) - Image validity is checked with
hasattr(img, 'copy')before processing - Images are saved via
saveImage()helper functions with compression settings:- PNG:
[cv2.IMWRITE_PNG_COMPRESSION, 0](no compression) - JPG:
[cv2.IMWRITE_JPEG_QUALITY, 90]
- PNG:
- Interpolation defaults to
cv2.INTER_CUBICfor resizing operations
Scripts use os.walk() to recursively process directories:
for root, subdirs, files in os.walk(args.input_folder):
for filename in files:
file_path = os.path.join(root, filename)
# process imageThe main dataset-tools.py script supports multiple --process_type options:
resize: Resize images to max dimension (default)square: Make images square by adding borderscrop: Crop to specific dimensions (use with--heightand--width)crop_to_square: Crop to square by removing edgescanny: Apply Canny edge detectioncanny-pix2pix: Create pix2pix paired images with Canny edgesscale: Scale by a factor (use with--scale)crop_square_patch: Random square cropmany_squares: Multiple square crops from one imagedistance: Distance transform processing
Many scripts support augmentation flags:
--mirror: Creates horizontally flipped versions--rotate: Creates 180-degree rotated versions
These are applied via flipImage() and rotateImage() helper functions after the main processing.
The obj_detect_cropper.py script integrates with external object detection tools:
- Runway CSV format: Expects CSV with bounding box coordinates
- YOLOv5 format: Expects .txt files with normalized coordinates
- Supports confidence thresholding via
--min_confidence - Can crop raw bounding boxes or expand to squares
The utils/load_images.py module provides multi-threaded image loading:
- Uses threading for parallel image loading
- Thread-safe queue-based architecture
- Useful for loading large datasets efficiently
The repository includes an auto-documentation workflow:
.github/workflows/update-docs.yml: GitHub Action that runs on push to main.github/scripts/generate-docs.py: Generatesdocs.mdby running--helpon all.pyfilesdocs.md: Auto-generated, should not be manually edited
opencv-python>=4.1.0.25: Core image processingnumpy>=1.7.0: Numerical operationsscipy: Distance transforms and scientific computingimutils: Rotation and image manipulation utilitieslpips: Perceptual similarity metrics (used in dedupe.py)scikit-learnandscikit-image: Machine learning and advanced image processingPyMuPDF: PDF image extractionpsd-tools3: PSD file supportmac-tag: macOS file tagging (macOS only)
- Follow the established naming convention: lowercase with hyphens or underscores
- Include standard
parse_args()function with argparse - Support
--input_folder,--output_folder, and--verboseat minimum - Use the common
saveImage()pattern for file output - Add
--helpsupport so the script appears in auto-generated docs
When working with border operations in dataset-tools.py:
- Border types:
stretch,reflect,solid,inpaint - Solid borders require
--border_colorin BGR format (e.g.,255,0,0for blue) - Division handling for centering is complex - check existing patterns in
makeSquare()function
Each process type in dataset-tools.py has its own function:
makeResize(): Resize operationsmakeSquare(): Square with bordersmakeSquareCrop(): Square by croppingmakeCanny(): Canny edge detectionmakeCrop(): Arbitrary dimension crops
Output directories are automatically created with naming pattern: {output_folder}/{type}-{size}/
When saving images, always split the original filename and replace the extension:
new_file = os.path.splitext(filename)[0] + ".png"This ensures consistency across different input formats.