A CLI tool to train LoRA adapters for text-to-image models using a folder of images with intelligent content-aware cropping.
flowchart TD
Start(["📁 Input Images"]) --> Load["Load Images"]
Load --> Check{"Crop Focus - Specified?"}
Check -->|Yes| YOLO["🔍 YOLO11 Detection"]
Check -->|No| Center["Center Crop"]
YOLO --> Found{"Target - Found?"}
Found -->|Yes| Crop["Smart Crop + Padding"]
Found -->|No| Skip["⏭️ Skip Image"]
Center --> Resize["Resize to target resolution"]
Crop --> Resize
Resize --> Train["🎯 LoRA Training (Diffusers + PEFT)"]
Skip --> Log
Train --> Save["💾 Save LoRA weights"]
Save --> Log["📊 Generate training_log.json"]
Log --> End(["✅ Trained LoRA + Logs"])
style Start fill:#E8F4F8,stroke:#2C5F7C,stroke-width:3px,color:#1a1a1a
style Load fill:#FFF4E6,stroke:#8B6914,stroke-width:2px,color:#1a1a1a
style Check fill:#F0E6FF,stroke:#6B46C1,stroke-width:2px,color:#1a1a1a
style YOLO fill:#E6F7FF,stroke:#1E5A8E,stroke-width:2px,color:#1a1a1a
style Center fill:#FFF0F5,stroke:#8B4789,stroke-width:2px,color:#1a1a1a
style Found fill:#F0E6FF,stroke:#6B46C1,stroke-width:2px,color:#1a1a1a
style Crop fill:#E6FFE6,stroke:#2D5F2D,stroke-width:2px,color:#1a1a1a
style Skip fill:#FFE6E6,stroke:#8B2E2E,stroke-width:2px,color:#1a1a1a
style Resize fill:#FFF4E6,stroke:#8B6914,stroke-width:2px,color:#1a1a1a
style Train fill:#E6F7FF,stroke:#1E5A8E,stroke-width:2px,color:#1a1a1a
style Save fill:#E6FFE6,stroke:#2D5F2D,stroke-width:2px,color:#1a1a1a
style Log fill:#FFF4E6,stroke:#8B6914,stroke-width:2px,color:#1a1a1a
style End fill:#E8F4F8,stroke:#2C5F7C,stroke-width:3px,color:#1a1a1a
- Content-Aware Cropping: Uses YOLO11 segmentation to automatically detect and crop to specific objects from the COCO dataset (faces, people, animals, etc.)
- Smart Filtering: Automatically skips images that don't contain the target feature
- Training Logs: Generates detailed JSON logs of processed, skipped, and failed images
- LoRA/QLoRA Training: Full training pipeline using
diffusersandpeftwith optional quantization - Multiple Model Support: Works with Stable Diffusion 1.5, SDXL, and Z-Image-Turbo (fast 8-step DiT model)
- Visual Verification: Includes a generation script to test your trained LoRA
Requires Python 3.13+ and uv for dependency management.
# Clone the repository
git clone git@github.com:paazmaya/image-lora-trainer.git
cd image-lora-trainer
# Install dependencies
uv syncThis tool requires a CUDA-capable GPU. Training on CPU is impractically slow for diffusion models.
-
Verify you have a CUDA-capable NVIDIA GPU
-
Install CUDA 13 drivers from NVIDIA's website
-
Install PyTorch with CUDA support:
uv pip uninstall torch torchvision -y uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
Its about 1.73 GB to download.
-
Verify GPU is detected:
uv run python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('GPU:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU')"You should see something like:
CUDA available: True GPU: NVIDIA GeForce RTX 4070 TiIf you see
CUDA available: False, the CPU-only version of PyTorch is installed. Follow step 3 above.
Place your training images in a folder:
my_images/
├── photo1.jpg
├── photo2.png
└── photo3.jpg
Basic training:
uv run python src/main.py --input-dir my_images --base-model runwayml/stable-diffusion-v1-5With content-aware cropping (only trains on images with faces):
uv run python src/main.py \
--input-dir my_images \
--base-model runwayml/stable-diffusion-v1-5 \
--crop-focus person \
--resolution 512 \
--steps 1000With QLoRA (4-bit quantization for lower memory usage):
uv run python src/main.py \
--input-dir my_images \
--base-model stabilityai/stable-diffusion-xl-base-1.0 \
--use-qlora \
--resolution 1024Z-Image-Turbo is a fast 6B parameter diffusion transformer that produces high-quality images in just 8 steps.
Basic Z-Image training:
uv run python src/main.py train-zimage \
--input-dir my_images \
--instance-prompt "a photo of sks person"With 8-bit quantization (for lower VRAM usage, ~12GB instead of ~24GB):
uv run python src/main.py train-zimage \
--input-dir my_images \
--instance-prompt "a photo of sks person" \
--use-8bit \
--steps 500With all options:
uv run python src/main.py train-zimage \
--input-dir my_images \
--instance-prompt "a photo of sks karate practitioner" \
--crop-focus person \
--use-8bit \
--lr 1e-5 \
--lora-rank 16 \
--steps 1000With locally available model:
Learning rate: 1e-5 (lower than SD/SDXL) LoRA rank: 16 (can go higher for more capacity) Steps: 500-1500 for characters/styles
Note: Z-Image-Turbo uses a training adapter by default to prevent the distillation from breaking during training. This is recommended for short training runs (styles, concepts, characters).
After training, check the training_log.json in your output directory:
{
"base_folder": "/absolute/path/to/my_images",
"trained": ["image1.png", "image2.png"],
"skipped": ["image3.png"],
"failed": []
}For Stable Diffusion:
uv run python src/generate.py sd \
--base-model runwayml/stable-diffusion-v1-5 \
--lora-path stable-diffusion-v1-5_my_images \
--prompt "a photo of a sks person" \
--output result.pngFor Z-Image-Turbo:
uv run python src/generate.py zimage \
--lora-path zimage-turbo_my_images \
--prompt "a photo of sks person, professional studio lighting" \
--output result.pngImportant: Use the same trigger word ("sks" in this example) that you specified in --instance-prompt during training.
More generation examples:
# Portrait with different styling
uv run python src/generate.py sd --lora-path <path> --prompt "portrait of sks person, oil painting"
# Different context
uv run python src/generate.py sd --lora-path <path> --prompt "sks person in a futuristic city"| Option | Description | Default |
|---|---|---|
--input-dir |
Path to training images | Required |
--output-dir |
Output directory for LoRA | Current directory |
--base-model |
Hugging Face model ID or local path | runwayml/stable-diffusion-v1-5 |
--resolution |
Training image resolution | 512 |
--crop-focus |
Object to focus on (e.g., "person", "face", "dog") | None (center crop) |
--use-qlora |
Enable 4-bit quantization | False |
--instance-prompt |
Training prompt with trigger word | "a photo of a sks person" |
--steps |
Number of training steps | 1000 |
--epochs |
Number of epochs (overrides steps) | None |
| Option | Description | Default |
|---|---|---|
--input-dir |
Path to training images | Required |
--output-dir |
Output directory for LoRA | Current directory |
--base-model |
Z-Image model ID | Tongyi-MAI/Z-Image-Turbo |
--resolution |
Training image resolution | 1024 |
--crop-focus |
Object to focus on | None (center crop) |
--use-8bit |
Enable 8-bit quantization | False |
--no-training-adapter |
Disable de-distillation adapter | False (adapter enabled) |
--instance-prompt |
Training prompt with trigger word | "a photo of a sks person" |
--steps |
Number of training steps | 1000 |
--lr |
Learning rate | 1e-5 |
--lora-rank |
LoRA rank | 16 |
--lora-alpha |
LoRA alpha | 16 |
--save-steps |
Save checkpoint every N steps | 500 |
About --instance-prompt:
The instance prompt contains a trigger word (like "sks") that the model learns to associate with your training images. This trigger word is what you'll use later when generating images with the LoRA.
- Use a unique, uncommon token (e.g., "sks", "xyz", "abc123")
- Include the class name (e.g., "person", "dog", "style")
- Example:
"a photo of sks person"→ Use"sks person"in generation prompts
SD Generation Options:
| Option | Description | Default |
|---|---|---|
--base-model |
Base model ID or path | runwayml/stable-diffusion-v1-5 |
--lora-path |
Path to trained LoRA | Required |
--prompt |
Generation prompt | Required |
--output |
Output filename | output.png |
--steps |
Inference steps | 30 |
Z-Image Generation Options:
| Option | Description | Default |
|---|---|---|
--base-model |
Z-Image model ID | Tongyi-MAI/Z-Image-Turbo |
--lora-path |
Path to trained LoRA | Required |
--prompt |
Generation prompt | Required |
--output |
Output filename | output.png |
--width |
Image width | 1024 |
--height |
Image height | 1024 |
--steps |
Inference steps | 8 |
--seed |
Random seed | None (random) |
--lora-scale |
LoRA weight scale | 1.0 |
When you specify --crop-focus, the tool uses YOLO11 to detect objects in your images:
- Supported objects: Any object in the COCO dataset (person, dog, cat, car, etc.)
- Behavior: Images without the target object are automatically skipped
- Fallback: If no focus is specified, images are center-cropped
Example crop focuses:
person- Crops to peopleface- Crops to faces (use "person" for full body)dog,cat- Crops to animalscar,truck- Crops to vehicles
Convert diffusers models or safetensor files to reduced precision formats:
# Convert a model directory to bfloat16 (default)
uv run python scripts/convert_floats.py --input H:/my-model
# Output: H:/my-model-bf16
# Convert to float8 e4m3fn (higher precision, good for inference)
uv run python scripts/convert_floats.py --input H:/my-model --dtype e4m3fn
# Output: H:/my-model-e4m3fn
# Convert to float8 e5m2 (wider range)
uv run python scripts/convert_floats.py --input H:/my-model --dtype e5m2
# Output: H:/my-model-e5m2
# Convert a single safetensors file
uv run python scripts/convert_floats.py --input H:/models/model.safetensors
# Output: H:/models/model-bf16.safetensorsSupported formats:
bf16- bfloat16 (16-bit, ~50% size reduction from fp32)e4m3fn- float8 with 4-bit exponent, 3-bit mantissa (higher precision)e5m2- float8 with 5-bit exponent, 2-bit mantissa (wider dynamic range)
uv run pytest tests/uv run ruff check --fix
uv run ruff formatAfter training, your output directory will contain:
stable-diffusion-v1-5_my_images/
├── adapter_config.json # LoRA configuration
├── adapter_model.safetensors # LoRA weights
├── training_log.json # Processing log
├── logs/ # Training logs
└── processed_images/ # Preprocessed images
MIT
- Built with Ultralytics YOLO11
- Uses Hugging Face Diffusers
- LoRA implementation via PEFT
- Z-Image-Turbo by Tongyi-MAI
- Z-Image training adapter by ostris
