This guide walks through training a LoRA on Tencent's 8.3B Hunyuan Video 1.5 release (tencent/HunyuanVideo-1.5) using SimpleTuner.
Hunyuan Video 1.5 is a large model (8.3B parameters).
- Minimum: 24GB-32GB VRAM is comfortable for a Rank-16 LoRA with full gradient checkpointing at 480p.
- Recommended: A6000 / A100 (48GB-80GB) for 720p training or larger batch sizes.
- System RAM: 64GB+ is recommended to handle model loading.
Add the following to your config.json:
View example config
{
"enable_group_offload": true,
"group_offload_type": "block_level",
"group_offload_blocks_per_group": 1,
"group_offload_use_stream": true
}--group_offload_use_stream: Only works on CUDA devices.- Do not combine this with
--enable_model_cpu_offload.
Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.13.
You can check this by running:
python --versionIf you don't have python 3.13 installed on Ubuntu, you can try the following:
apt -y install python3.13 python3.13-venvFor Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.8 image to enable compiling of CUDA extensions:
apt -y install nvidia-cuda-toolkitThe following must be executed for an AMD MI300X to be useable:
apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popdInstall SimpleTuner via pip:
pip install 'simpletuner[cuda]'
# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130For manual installation or development setup, see the installation documentation.
The main tencent/HunyuanVideo-1.5 repo contains the transformer/vae/scheduler, but the text encoder (text_encoder/llm) and vision encoder (vision_encoder/siglip) live in separate downloads. Point SimpleTuner at your local copies before launching:
export HUNYUANVIDEO_TEXT_ENCODER_PATH=/path/to/text_encoder_root
export HUNYUANVIDEO_VISION_ENCODER_PATH=/path/to/vision_encoder_rootIf these are unset, SimpleTuner tries to pull them from the model repo; most mirrors do not bundle them, so set the paths explicitly to avoid startup errors.
The SimpleTuner WebUI makes setup fairly straightforward. To run the server:
simpletuner serverThis will create a webserver on port 8001 by default, which you can access by visiting http://localhost:8001.
To run SimpleTuner via command-line tools, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration.
Note: This doesn't configure your dataloader. You will still have to do that manually, later.
To run it:
simpletuner configureIf you prefer to manually configure:
Copy config/config.json.example to config/config.json:
cp config/config.json.example config/config.jsonKey configuration overrides for HunyuanVideo:
View example config
{
"model_type": "lora",
"model_family": "hunyuanvideo",
"pretrained_model_name_or_path": "tencent/HunyuanVideo-1.5",
"model_flavour": "t2v-480p",
"output_dir": "output/hunyuan-video",
"validation_resolution": "854x480",
"validation_num_video_frames": 61,
"validation_guidance": 6.0,
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"learning_rate": 1e-4,
"mixed_precision": "bf16",
"optimizer": "adamw_bf16",
"lora_rank": 16,
"enable_group_offload": true,
"group_offload_type": "block_level",
"dataset_backend_config": "config/multidatabackend.json"
}model_flavouroptions:t2v-480p(Default)t2v-720pi2v-480p(Image-to-Video)i2v-720p(Image-to-Video)
validation_num_video_frames: Must be(frames - 1) % 4 == 0. E.g., 61, 129.
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance.
- Scheduled Sampling (Rollout): reduces exposure bias and improves output quality by letting the model generate its own inputs during training.
⚠️ These features increase the computational overhead of training.
Create a --data_backend_config (config/multidatabackend.json) document containing this:
[
{
"id": "my-video-dataset",
"type": "local",
"dataset_type": "video",
"instance_data_dir": "datasets/videos",
"caption_strategy": "textfile",
"resolution": 480,
"video": {
"num_frames": 61,
"min_frames": 61,
"frame_rate": 24,
"bucket_strategy": "aspect_ratio"
},
"repeats": 10
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/hunyuan",
"disabled": false
}
]In the video subsection:
num_frames: Target frame count for training. Must satisfy(frames - 1) % 4 == 0.min_frames: Minimum video length (shorter videos are discarded).max_frames: Maximum video length filter.bucket_strategy: How videos are grouped into buckets:aspect_ratio(default): Group by spatial aspect ratio only.resolution_frames: Group byWxH@Fformat (e.g.,854x480@61) for mixed-resolution/duration datasets.
frame_interval: When usingresolution_frames, round frame counts to this interval.
See caption_strategy options and requirements in DATALOADER.md.
- Text Embed Caching: Highly recommended. Hunyuan uses a large LLM text encoder. Caching saves significant VRAM during training.
wandb login
huggingface-cli loginFrom the SimpleTuner directory:
simpletuner train- Group Offload: Essential for consumer GPUs. Ensure
enable_group_offloadis true. - Resolution: Stick to 480p (
854x480or similar) if you have limited VRAM. 720p (1280x720) increases memory usage significantly. - Quantization: Use
base_model_precision(bf16default);int8-torchaoworks for further savings at the cost of speed. - VAE patch convolution: For HunyuanVideo VAE OOMs, set
--vae_enable_patch_conv=true(or toggle in the UI). This slices 3D conv/attention work to lower peak VRAM; expect a small throughput hit.
- Use
model_flavour="i2v-480p"ori2v-720p. - SimpleTuner automatically uses the first frame of your video dataset samples as the conditioning image during training.
For validation with i2v models, you have two options:
-
Auto-extracted first frame: By default, validation uses the first frame from video samples in your dataset.
-
Separate image dataset (simpler setup): Use
--validation_using_datasets=truewith--eval_dataset_idpointing to an image dataset. This allows you to use any image dataset as the first-frame conditioning input for validation videos, without needing to set up the complex conditioning dataset pairing used during training.
Example config for option 2:
{
"validation_using_datasets": true,
"eval_dataset_id": "my-image-dataset"
}Hunyuan uses a dual text encoder setup (LLM + CLIP). Ensure your system RAM can handle loading these during the caching phase.