GenDiff-PEFT is a research-focused implementation and optimization study of conditional diffusion models on CIFAR-10.
This project investigates architectural and sampling strategies to improve image quality while significantly reducing training cost.
- Multi-resolution attention (8×8 and 16×16)
- Classifier-Free Guidance (CFG) ablation study
- DDIM sampling optimization (50 → 100 steps)
- Parameter-efficient fine-tuning
- Quantitative evaluation using FID and Inception Score
- 17.1% FID improvement (290.19 → 240.48)
- 85% reduction in training time
- Empirical validation of the quality–diversity tradeoff
Diffusion models achieve state-of-the-art generative performance but are computationally expensive and sensitive to architectural design.
This project explores:
- The effect of attention resolution on semantic coherence
- The tradeoff between fidelity and diversity via CFG scaling
- The impact of sampling steps on output refinement
- Transfer learning for efficient experimentation
- Conditional UNet
- CIFAR-10 (32×32×3)
- Attention at 8×8 resolution
- 50-step DDIM sampling
- 95 epochs training
- AdamW optimizer
- Mixed precision (FP16)
- Added 16×16 multi-head self-attention
- Extended DDIM sampling (100 steps)
- Cosine learning rate scheduling
- Exponential Moving Average (EMA)
- Fine-tuned from baseline checkpoint (10 epochs)
| Configuration | FID ↓ | IS ↑ | Steps | CFG Scale |
|---|---|---|---|---|
| Baseline | 290.19 | 2.240 | 50 | 5.0 |
| Improved (w=7.5) | 240.48 | 1.563 | 100 | 7.5 |
| Improved (w=3.5) | 292.46 | 1.856 | 100 | 3.5 |
- Higher CFG improves fidelity but reduces diversity.
- Multi-resolution attention improves structural coherence.
- Increasing sampling steps significantly refines generation quality.
- Fine-tuning reduces training time from ~8.5 hours to ~1 hour.
Increasing DDIM steps from 50 to 100 improves FID by approximately 8–12 points.
Adding 16×16 attention improves mid-level feature modeling and reduces FID by 5–7 points.
CFG scale analysis reveals a fundamental quality–diversity tradeoff:
- Higher guidance → better fidelity
- Lower guidance → better diversity
- Optimizer: AdamW
- Learning Rate: 2e-4 (baseline), 1e-4 (fine-tune)
- Scheduler: Cosine Annealing
- Batch Size: 128
- Mixed Precision (FP16)
- EMA decay: 0.9999
- Deterministic DDIM sampling (η = 0)
GenDiff-PEFT/ │ ├── notebooks/ │ └── diffusion_unet_experiments.ipynb │ ├── reports/ │ └── Improving_Image_Quality_and_Training_Efficiency_in_Conditional_Diffusion_Models.pdf │ ├── models/ │ ├── baseline/ │ └── finetuned/ │ ├── results/ │ └── README.md
pip install -r requirements.txt
python train.py
python finetune.py
python sample.py --cfg 7.5 --steps 100
- Frechet Inception Distance (FID)
- Inception Score (IS)
- 10,000 generated samples per configuration
- Deterministic sampling for reproducibility
- Dynamic CFG scheduling
- Efficient attention variants (linear or sparse attention)
- Diversity-preserving guidance strategies
- Higher resolution datasets (256×256+)
- Sampling acceleration via distillation or consistency models
Krishna Koushik Unnam
M.S. Computer Science & Engineering
University of South Florida