📄 Paper | 🌐 Project-Page | 🤗 ILLUME+ Models | 🤗 ILLUME+ Demo
Welcome to the official repository for our paper: "ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement"
We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified Multimodal Large Language Model (MLLM) and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications.
- Release model checkpoint and inference code.
- Release training and inference code for the vision tokenizer and MLLM.
- Release 7B LLM checkpoint.
- Release training code for the diffusion decoder.
| Model Name | 🤗 HF Format | Origin Format | Config |
|---|---|---|---|
| ILLUME+ 3B | Link | Link | Config |
| ILLUME+ 7B | Link | Link | Config |
Performance
| Model | POPE | MMBench | SEED | MME-P | MM-Vet | MMMU | AI2D | VQA-text | ChartQA | DocVQA | InfoVQA | OCRBench |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ILLUME+ 3B | 87.6 | 80.8 | 73.3 | 1414.0 | 40.3 | 44.3 | 74.2 | 69.9 | 69.9 | 80.8 | 44.1 | 672 |
| ILLUME+ 7B | 88.7 | 79.3 | 74.3 | 1547.1 | 47.7 | 37.6 | 78.0 | 75.2 | 82.1 | 88.6 | 0.5745 | 772 |
| Model | MJHQ30k FID ↓ |
GenAI-bench | GenEval | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Basic | Advanced | Overall | Single Obj | Two Obj | Counting | Colors | Position | Color Attri. | ||
| ILLUME+ 3B | 6.00 | 0.72 | 0.71 | 0.72 | 0.99 | 0.88 | 0.62 | 0.84 | 0.42 | 0.53 |
| ILLUME+ 7B | 5.78 | 0.72 | 0.72 | 0.74 | 0.99 | 0.88 | 0.60 | 0.87 | 0.54 | 0.58 |
Note that the data for training the 7B model is slightly different to the 3B model.
| Model Name | Codebook Size | Checkpoint | Config | Diffusion Decoder |
|---|---|---|---|---|
| DualViTok | 32K(Sem) + 98K(pixel) | Link | Config | SDXL |
- Set up environment
git clone https://github.com/illume-unified-mllm/ILLUME_plus
cd ILLUME_plus
conda create -n illume python==3.9 -y
conda activate illume- Install the required packages (note that instructions are different from GPUs and NPUs):
cd ILLUME_plus
export CODE_DIR=$(pwd)
# make sure you set the environment variables
export PYTHONPATH=$PYTHONPATH:$CODE_DIR/ILLUME/
export PYTHONPATH=$PYTHONPATH:$CODE_DIR/vision_tokenizer/
# upgrade pip and setuptools if necessary
pip install -U pip setuptools
# install packages for ILLUME
cd $CODE_DIR/ILLUME
pip install -e . # for NVIDIA GPUs
pip install -e ".[npu]" # OR for Ascend NPUs
# (GPU only) install flash attention for better efficiency
pip install flash-attn --no-build-isolationThis section consolidates the download steps for all necessary pre-trained models. Ensure these are downloaded before proceeding to demos or inference.
-
Download the ILLUME+ MLLM Checkpoint: This is the main model checkpoint.
huggingface-cli download ILLUME-MLLM/illume_plus-qwen2_5-3b --local-dir=checkpoints/illume_plus-qwen2_5-3b
-
Download the DualViTok (Vision Tokenizer) Checkpoint: This checkpoint is for the vision tokenizer.
huggingface-cli download ILLUME-MLLM/dualvitok --local-dir=checkpoints/dualvitok
-
Download the SDXL Decoder Model: This model is used for decoding images during generation.
huggingface-cli download ILLUME-MLLM/dualvitok-sdxl-decoder --local-dir=checkpoints/dualvitok-sdxl-decoder
Please check our model's usage and inference examples on HuggingFace: illume_plus-qwen2_5-3b-hf
To run inference using the models downloaded in the "Setup & Installation" section: Then follow inference.ipynb.
Try out ILLUME+ in your browser using our interactive Demo
To host your own demo locally, follow these steps:
# using 🤗 HF format checkpoint.
cd ILLUME
python app_hf.py \
--model_name ILLUME-MLLM/illume_plus-qwen2_5-3b-hf \
--diffusion_decoder_path ILLUME-MLLM/dualvitok-sdxl-decoder \
--tokenizer_path ILLUME-MLLM/dualvitok \
--torch_dtype bf16
# Using the origin format checkpoint.
cd ILLUME
## Create the link to the `output_dir` defined in config.
mkdir -p ./logdir/illume_plus_3b/
ln -s $(pwd)/../checkpoints/illume_plus-qwen2_5-3b $(pwd)/logdir/illume_plus_3b/illume_plus-qwen2_5-3b_stage3/
## Run the app.py
python app.py --config ../configs/example/illume_plus_3b/illume_plus_qwen2_5_3b_stage3.py \
--tokenizer_config ../configs/example/dualvitok/dualvitok_anyres_max512.py \
--tokenizer_checkpoint ../checkpoints/dualvitok/pytorch_model.bin \
--diffusion_decoder_path ../checkpoints/dualvitok-sdxl-decoder \
--torch_dtype=bf16Note that we implement InterleavedLogitsProcessor in inference for three key reasons:
- To activate the image generation mode with classifier-free guidance when meeting
<start_of_image>tokens. - To handle the varying number of image tokens across different resolutions and ensure proper alignment between semantic-level tokens and pixel-level tokens in each line.
- To prevent sampling of incorrect modality tokens when
do_sample=Trueis enabled during text or image generation.
cd vision_tokenizer
python app.py ../configs/example/dualvitok/dualvitok_anyres_max512.py \
--vq-ckpt ../checkpoints/dualvitok/pytorch_model.bin \
--sdxl-decoder-path ../checkpoints/dualvitok-sdxl-decoderTo train the DualViTok, navigate to the vision_tokenizer directory and use the provided scripts.
We provide the examples to train the tokenizer on ImageNet. You could add more datase on the config file.
First, link the imagenet dataset as below.
cd vision_tokenizer
mkdir data
ln -s /path/to/imagenet_train_set/ ./data/imagenet_train
ln -s /path/to/imagenet_val_set/ ./data/imagenet_valExample 1. Training with fixed 256 resolution: This training uses images resized and center-cropped to 256x256.
cd vision_tokenizer
torchrun --nproc_per_node 8 tokenizer/train_dualvitok.py ../configs/example/dualvitok/dualvitok_fix256.pyExample 2. Training with variable resolution (max 512x512): This training uses images with a maximum resolution of 512x512 and groups samples with similar resolutions into batches.
cd vision_tokenizer
# First, record all image size to json file.
python scripts/read_folder_image_sizes.py --input_folder ./data/imagenet_train/ --output_json ./data/json_files/imagenet_train.json
# Then run the script to train vision tokenizer.
torchrun --nproc_per_node 8 tokenizer/train_dualvitok.py ../configs/example/dualvitok/dualvitok_anyres_max512.py We use the torch_fidelity to calculate the rFID. Make sure install this before your evaluation.
To run reconstruction inference on the ImageNet validation set (50k images):
cd vision_tokenizer
torchrun --nproc_per_node 8 tokenizer/reconstruction_vq_ddp.py ../configs/example/dualvitok/dualvitok_anyres_max512.py \
--vq-ckpt=../checkpoints/dualvitok/pytorch_model.bin --model-dtype fp32Step 1: Prepare Data and Tokenizer Checkpoint Please refer to Data.md.
Step 2: Prepare mllm
# extend vision codebook to the llm
python ILLUME/scripts/prepare_llm_with_extended_vision_tokenizer.py \
--model_path Qwen/Qwen2.5-3B-Instruct \
--semantic_codebook_size 32768 \
--pixel_codebook_size 98304 \
--output_model_path checkpoints/Qwen2.5-3B-Instruct-with-vision-tokenizer-32k-96k-level2Configure configs/example/illume_debug/illume_debug.py and run the training command:
export PYTHONPATH=$PYTHONPATH:$CODE_DIR/ILLUME/
export PYTHONPATH=$PYTHONPATH:$CODE_DIR/vision_tokenizer/
cd ILLUME
torchrun --nproc_per_node=8 illume/train/train.py ../configs/example/illume_debug/illume_debug.pyIf you want to finetune the pretrained model, modify the model_args.language_model.pretrained_model_name_or_path to the pretrained checkpoint:
torchrun --nproc_per_node=8 illume/train/train.py ../configs/example/illume_debug/illume_debug.py --model_args.language_model.pretrained_model_name_or_path='/path/to/checkpoints/'First, please install lmms-eval to evaluate the image understanding tasks. The scripts evaluated on lmms-eval=0.3.0.
copy the files under ILLUME/scripts/lmms_eval/ to the lmms-eval/lmms_eval/ in the same structure.
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cp ILLUME/scripts/lmms_eval/models/illume.py lmms-eval/lmms_eval/models/illume.py
cp ILLUME/scripts/lmms_eval/models/__init__.py lmms-eval/lmms_eval/models/__init__.py
cp ILLUME/scripts/lmms_eval/tasks/mme/* lmms-eval/lmms_eval/tasks/mme/
cd lmms-eval;
git checkout v0.3.0
pip install -e . ; cd ..Then, set the config in the --model_args to evaluate the specific model. The mllm ckpt should be in training_args.output_dir in the config ../configs/example/illume_plus_3b/illume_plus_qwen2_5_3b_stage3.py. Running the following command:
cd ILLUME
accelerate launch --num_processes 8 -m lmms_eval \
--model illume --tasks mme_nopost --batch_size 1 \
--log_samples --log_samples_suffix illume_plus_3b \
--output_path ./logs/ --model_args pretrained=../configs/example/illume_plus_3b/illume_plus_qwen2_5_3b_stage3.pyThe inference datasets are defined in meta_dataset_configs.py.
See test data format of examples in t2i_test_examples.jsonl.
To run text-to-image generation:
cd ILLUME
bash scripts/inference_text_to_image.shThe inference datasets are defined in meta_dataset_configs.py.
See test data format of examples in edit_test_examples.jsonl.
To run image editing inference:
cd ILLUME
bash scripts/inference_image_editing.shWe would like to acknowledge LLaVA, EMOVA, and LlamaGen for their inspiring work.
If you find our paper helpful, please consider citing our papers and staring us!
@article{huang2025illume+,
title={Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement},
author={Huang, Runhui and Wang, Chunwei and Yang, Junwei and Lu, Guansong and Yuan, Yunlong and Han, Jianhua and Hou, Lu and Zhang, Wei and Hong, Lanqing and Zhao, Hengshuang and others},
journal={arXiv preprint arXiv:2504.01934},
year={2025}
}



