23 lines (15 loc) · 838 Bytes

SILMM-Implementation

Last Update: 2025.01.26; All codes are implemented by myself.

This is the UNOFFICIAL code implementation for the paper:

"SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation (2024.12)"

📍 Notice

I only implemented for SEED-LLaMA model (Discrete MLLM).
By training only 1 iteration, the paper results are reproduced. (Authors trained for 3 iterations.)

📍 5-Step (code implementation)

1. Compositional Prompt Generation
2. Diverse Image Generation
3. Decompositional Self-Questioning
4. VQA-based Self Feedback
5. Learning from Self-Feedback

📍 Baseline Model

Please refer to this original SEED-LLaMA repository.