Zero-Shot-Styled-Text-Image-Generation-but-Make-It-Autoregressive

Official PyTorch implementation for "Zero-Shot Styled Text Image Generation, but Make It Autoregressive" (CVPR25)

Introduction • Installation • Training • Evaluation • Inference • Configuration • Citation

Introduction

Official PyTorch implementation for "Zero-Shot-Styled-Text-Image-Generation-but-Make-It-Autoregressive", presenting Emuru: a conditional generative model that integrates a T5-based decoder with a Variational Autoencoder (VAE) for image generation conditioned on text and style images. It allows users to combine textual prompts (e.g., style text, generation text) and style images to create new, synthesized images.

The code is tested on Python 3.11.13, CUDA 12.8, and PyTorch 2.7.1 using a NVIDIA RTX 4090 GPU.

We use the Accelerate library for multi-GPU training and Wandb for logging.

Overview of the proposed Emuru.

Installation

Install the required Python packages to train the model:

conda create --name emuru python=3.11.13
conda activate emuru
pip install -r requirements.txt

Training

Our model is composed of a Variational Autoencoder and a T5 Decoder. We provide all intermediate artifacts in the Releases, but you can train your own following the next steps.

This code is set to stream our synthetic dataset Font-Square.

Here we provide a minimal set of examples, with more details on parameters in the dedicated Section.

Train the T5

You can either provide the path to your trained VAE Guide or load the pretrained VAE from Huggingface (which we also provide in this Release):

python train_T5.py --vae_path "blowing-up-groundhogs/emuru_vae"

Train the VAE (Optional)

Training the VAE from scratch requires setting up auxiliary models for text correctness (HTR) and style (Writer ID) losses. If you don't want to train your own, you can use our pre-trained VAE from Hugging Face.

Step 1: Set Up Auxiliary Models You have two options: download our pre-trained ones or train your own.

Option A: Download Pre-trained Auxiliary Models (Recommended)

mkdir -p pretrained_models
wget -P pretrained_models https://github.com/aimagelab/Zero-Shot-Styled-Text-Image-Generation-but-Make-It-Autoregressive/releases/download/emuru_vae_htr/emuru_vae_htr.tar.gz
wget -P pretrained_models https://github.com/aimagelab/Zero-Shot-Styled-Text-Image-Generation-but-Make-It-Autoregressive/releases/download/emuru_vae_wid/emuru_vae_wid.tar.gz
tar -xzvf pretrained_models/emuru_vae_htr.tar.gz -C pretrained_models/
tar -xzvf pretrained_models/emuru_vae_wid.tar.gz -C pretrained_models/

Option B: Train Auxiliary Models From Scratch

Train the Handwritten Text Recognition Model:

python train_htr.py

Train the Writer Identification Model:

python train_writer_id.py

Step 2: Train the VAE Once the auxiliary models are in the pretrained_models/ directory, you can start the VAE training.

python train_vae.py --htr_path "pretrained_models/emuru_vae_htr" --writer_id_path "pretrained_models/emuru_vae_writer_id"

Evaluation

We compare Emuru with several state-of-the-art generative models, including DiffusionPen, One-DM, VATr++, VATr, and HiGANplus. We use the library HWD.

Inference

For inference, please refer to the guide on HuggingFace.

Configuration Parameters

This section details some of the command-line arguments you can use to configure the training scripts. Common ones are:

Parameter	Default	Description
`--run_id`	(random)	A unique ID for the current run, generated automatically.
`--eval_epochs`	`1`	How often to run evaluation, in epochs.
`--report_to`	`None`	Platform to report metrics to (e.g., `wandb`).
`--wandb_project_name`	`emuru_vae`	The project name on Weights & Biases.
`--wandb_entity`	`None`	Your W&B entity (username or team name).
`--wandb_log_interval_steps`	`25`	How often to log metrics to W&B, in steps.
`--checkpoints_total_limit`	`5`	The maximum number of checkpoints to keep.
`--resume_id`	`None`	ID of a previous run to resume training from.
`--epochs`	`10000`	The total number of training epochs.
`--lr`	`1e-4`	The initial learning rate.
`--lr_scheduler`	`reduce_lr_on_plateau`	The type of learning rate scheduler to use.
`--lr_scheduler_patience`	`5`	Patience in epochs for the LR scheduler before reducing the learning rate.
`--gradient_accumulation_steps`	`1`	Number of steps to accumulate gradients before an optimizer step.
`--seed`	`24`	Random seed for ensuring reproducibility.
`--mixed_precision`	`no`	Whether to use mixed precision training. Options: `no`, `fp16`, `bf16`.

T5 Training

File Paths and Directories

Parameter	Default	Description
`--output_dir`	`results_t5`	Directory to save model checkpoints and outputs.
`--logging_dir`	`results_t5`	Directory where logs will be stored.
`--vae_path`	`blowing-up-groundhogs/emuru_vae`	Path to the pre-trained VAE checkpoint on the Hugging Face Hub.

Training Hyperparameters

Parameter	Default	Description
`--training_type`	`pretrain`	Set the training mode. Options: `pretrain`, `finetune`.
`--train_batch_size`	`2`	The batch size for the training dataset.
`--eval_batch_size`	`8`	The batch size for the evaluation dataset.
`--teacher_noise`	`0.1`	Amount of noise added during teacher-forcing.

VAE

File Paths and Directories

Parameter	Default	Description
`--output_dir`	`results_vae`	Directory to save model checkpoints and outputs.
`--logging_dir`	`results_vae`	Directory where logs will be stored.
`--vae_config`	`configs/vae/VAE_64x768.json`	Path to the VAE's JSON configuration file.
`--htr_path`	`pretrained_models/emuru_vae_htr`	Path to the HTR model checkpoint for the auxiliary loss.
`--writer_id_path`	`pretrained_models/emuru_vae_writer_id`	Path to the Writer ID model checkpoint for the auxiliary loss.

Training Hyperparameters

Parameter	Default	Description
`--train_batch_size`	`16`	The batch size for the training dataset.
`--eval_batch_size`	`16`	The batch size for the evaluation dataset.

Citation

If you find it useful, please cite it as:

@InProceedings{Pippi_2025_CVPR,
    author    = {Pippi, Vittorio and Quattrini, Fabio and Cascianelli, Silvia and Tonioni, Alessio and Cucchiara, Rita},
    title     = {Zero-Shot Styled Text Image Generation, but Make It Autoregressive},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {7910-7919}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
custom_datasets		custom_datasets
imgs		imgs
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_T5.py		train_T5.py
train_htr.py		train_htr.py
train_vae.py		train_vae.py
train_writer_id.py		train_writer_id.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zero-Shot-Styled-Text-Image-Generation-but-Make-It-Autoregressive

Introduction

Installation

Training

Train the T5

Train the VAE (Optional)

Evaluation

Inference

Configuration Parameters

T5 Training

File Paths and Directories

Training Hyperparameters

VAE

File Paths and Directories

Training Hyperparameters

Citation

About

Uh oh!

Releases 3

Packages

Languages

License

aimagelab/Emuru-autoregressive-text-img

Folders and files

Latest commit

History

Repository files navigation

Zero-Shot-Styled-Text-Image-Generation-but-Make-It-Autoregressive

Introduction

Installation

Training

Train the T5

Train the VAE (Optional)

Evaluation

Inference

Configuration Parameters

T5 Training

File Paths and Directories

Training Hyperparameters

VAE

File Paths and Directories

Training Hyperparameters

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages