Skip to content

Latest commit

 

History

History
270 lines (191 loc) · 8.03 KB

File metadata and controls

270 lines (191 loc) · 8.03 KB

Hunyuan Video 1.5 Quickstart

This guide walks through training a LoRA on Tencent's 8.3B Hunyuan Video 1.5 release (tencent/HunyuanVideo-1.5) using SimpleTuner.

Hardware requirements

Hunyuan Video 1.5 is a large model (8.3B parameters).

  • Minimum: 24GB-32GB VRAM is comfortable for a Rank-16 LoRA with full gradient checkpointing at 480p.
  • Recommended: A6000 / A100 (48GB-80GB) for 720p training or larger batch sizes.
  • System RAM: 64GB+ is recommended to handle model loading.

Memory offloading (optional)

Add the following to your config.json:

View example config
{
  "enable_group_offload": true,
  "group_offload_type": "block_level",
  "group_offload_blocks_per_group": 1,
  "group_offload_use_stream": true
}
  • --group_offload_use_stream: Only works on CUDA devices.
  • Do not combine this with --enable_model_cpu_offload.

Prerequisites

Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.13.

You can check this by running:

python --version

If you don't have python 3.13 installed on Ubuntu, you can try the following:

apt -y install python3.13 python3.13-venv

Container image dependencies

For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.8 image to enable compiling of CUDA extensions:

apt -y install nvidia-cuda-toolkit

AMD ROCm follow-up steps

The following must be executed for an AMD MI300X to be useable:

apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popd

Installation

Install SimpleTuner via pip:

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For manual installation or development setup, see the installation documentation.

Required checkpoints

The main tencent/HunyuanVideo-1.5 repo contains the transformer/vae/scheduler, but the text encoder (text_encoder/llm) and vision encoder (vision_encoder/siglip) live in separate downloads. Point SimpleTuner at your local copies before launching:

export HUNYUANVIDEO_TEXT_ENCODER_PATH=/path/to/text_encoder_root
export HUNYUANVIDEO_VISION_ENCODER_PATH=/path/to/vision_encoder_root

If these are unset, SimpleTuner tries to pull them from the model repo; most mirrors do not bundle them, so set the paths explicitly to avoid startup errors.

Setting up the environment

Web interface method

The SimpleTuner WebUI makes setup fairly straightforward. To run the server:

simpletuner server

This will create a webserver on port 8001 by default, which you can access by visiting http://localhost:8001.

Manual / command-line method

To run SimpleTuner via command-line tools, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.

Configuration file

An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration.

Note: This doesn't configure your dataloader. You will still have to do that manually, later.

To run it:

simpletuner configure

If you prefer to manually configure:

Copy config/config.json.example to config/config.json:

cp config/config.json.example config/config.json

Key configuration overrides for HunyuanVideo:

View example config
{
  "model_type": "lora",
  "model_family": "hunyuanvideo",
  "pretrained_model_name_or_path": "tencent/HunyuanVideo-1.5",
  "model_flavour": "t2v-480p",
  "output_dir": "output/hunyuan-video",
  "validation_resolution": "854x480",
  "validation_num_video_frames": 61,
  "validation_guidance": 6.0,
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "learning_rate": 1e-4,
  "mixed_precision": "bf16",
  "optimizer": "adamw_bf16",
  "lora_rank": 16,
  "enable_group_offload": true,
  "group_offload_type": "block_level",
  "dataset_backend_config": "config/multidatabackend.json"
}
  • model_flavour options:
    • t2v-480p (Default)
    • t2v-720p
    • i2v-480p (Image-to-Video)
    • i2v-720p (Image-to-Video)
  • validation_num_video_frames: Must be (frames - 1) % 4 == 0. E.g., 61, 129.

Advanced Experimental Features

Show advanced experimental details

SimpleTuner includes experimental features that can significantly improve training stability and performance.

  • Scheduled Sampling (Rollout): reduces exposure bias and improves output quality by letting the model generate its own inputs during training.

⚠️ These features increase the computational overhead of training.

Dataset considerations

Create a --data_backend_config (config/multidatabackend.json) document containing this:

[
  {
    "id": "my-video-dataset",
    "type": "local",
    "dataset_type": "video",
    "instance_data_dir": "datasets/videos",
    "caption_strategy": "textfile",
    "resolution": 480,
    "video": {
        "num_frames": 61,
        "min_frames": 61,
        "frame_rate": 24,
        "bucket_strategy": "aspect_ratio"
    },
    "repeats": 10
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/hunyuan",
    "disabled": false
  }
]

In the video subsection:

  • num_frames: Target frame count for training. Must satisfy (frames - 1) % 4 == 0.
  • min_frames: Minimum video length (shorter videos are discarded).
  • max_frames: Maximum video length filter.
  • bucket_strategy: How videos are grouped into buckets:
    • aspect_ratio (default): Group by spatial aspect ratio only.
    • resolution_frames: Group by WxH@F format (e.g., 854x480@61) for mixed-resolution/duration datasets.
  • frame_interval: When using resolution_frames, round frame counts to this interval.

See caption_strategy options and requirements in DATALOADER.md.

  • Text Embed Caching: Highly recommended. Hunyuan uses a large LLM text encoder. Caching saves significant VRAM during training.

Login to WandB and Huggingface Hub

wandb login
huggingface-cli login

Executing the training run

From the SimpleTuner directory:

simpletuner train

Notes & troubleshooting tips

VRAM Optimization

  • Group Offload: Essential for consumer GPUs. Ensure enable_group_offload is true.
  • Resolution: Stick to 480p (854x480 or similar) if you have limited VRAM. 720p (1280x720) increases memory usage significantly.
  • Quantization: Use base_model_precision (bf16 default); int8-torchao works for further savings at the cost of speed.
  • VAE patch convolution: For HunyuanVideo VAE OOMs, set --vae_enable_patch_conv=true (or toggle in the UI). This slices 3D conv/attention work to lower peak VRAM; expect a small throughput hit.

Image-to-Video (I2V)

  • Use model_flavour="i2v-480p" or i2v-720p.
  • SimpleTuner automatically uses the first frame of your video dataset samples as the conditioning image during training.

I2V Validation Options

For validation with i2v models, you have two options:

  1. Auto-extracted first frame: By default, validation uses the first frame from video samples in your dataset.

  2. Separate image dataset (simpler setup): Use --validation_using_datasets=true with --eval_dataset_id pointing to an image dataset. This allows you to use any image dataset as the first-frame conditioning input for validation videos, without needing to set up the complex conditioning dataset pairing used during training.

Example config for option 2:

{
  "validation_using_datasets": true,
  "eval_dataset_id": "my-image-dataset"
}

Text Encoders

Hunyuan uses a dual text encoder setup (LLM + CLIP). Ensure your system RAM can handle loading these during the caching phase.