In this example, we'll train a Kandinsky 5.0 Video LoRA (Lite or Pro) using the HunyuanVideo VAE and dual text encoders.
Kandinsky 5.0 Video is a heavy model. It combines:
- Qwen2.5-VL (7B): A massive vision-language text encoder.
- HunyuanVideo VAE: A high-quality 3D VAE.
- Video Transformer: A complex DiT architecture.
This setup is VRAM-intensive, though the "Lite" and "Pro" variants have different requirements.
- Lite Model Training: Surprisingly efficient, capable of training on ~13GB VRAM.
- Note: The initial VAE pre-caching step requires significantly more VRAM due to the massive HunyuanVideo VAE. You may need to use CPU offloading or a larger GPU just for the caching phase.
- Tip: Set
"offload_during_startup": truein yourconfig.jsonto ensure the VAE and text encoder are not loaded to the GPU at the same time, which significantly reduces pre-caching memory pressure. - If VAE OOMs: Set
--vae_enable_patch_conv=trueto slice HunyuanVideo VAE 3D convs; expect a small speed hit but lower peak VRAM.
- Pro Model Training: Requires FSDP2 (multi-gpu) or aggressive Group Offload with LoRA to fit on consumer hardware. Specific VRAM/RAM requirements have not been established, but "the more, the merrier" applies.
- System RAM: Testing was comfortable on a system with 45GB RAM for the Lite model. 64GB+ is recommended to be safe.
For almost any single-GPU setup training the Pro model, you must enable grouped offloading. It is optional but recommended for Lite to save VRAM for larger batches/resolutions.
Add this to your config.json:
View example config
{
"enable_group_offload": true,
"group_offload_type": "block_level",
"group_offload_blocks_per_group": 1,
"group_offload_use_stream": true
}Ensure Python 3.12 is installed.
python --versionpip install 'simpletuner[cuda]'
# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130See INSTALL.md for advanced installation options.
simpletuner serverAccess at http://localhost:8001.
Run the helper script:
simpletuner configureOr copy the example and edit manually:
cp config/config.json.example config/config.jsonKey settings for Kandinsky 5 Video:
model_family:kandinsky5-videomodel_flavour:t2v-lite-sft-5s: Lite model, ~5s output. (Default)t2v-lite-sft-10s: Lite model, ~10s output.t2v-pro-sft-5s-hd: Pro model, ~5s, higher definition training.t2v-pro-sft-10s-hd: Pro model, ~10s, higher definition training.i2v-lite-5s: Image-to-video Lite, 5s outputs (requires conditioning images).i2v-pro-sft-5s: Image-to-video Pro SFT, 5s outputs (requires conditioning images).- (Pretrain variants also available for all above)
train_batch_size:1. Do not increase this unless you have an A100/H100.validation_resolution:512x768is a safe default for testing.720x1280(720p) is possible but heavy.
validation_num_video_frames: Must be compatible with VAE compression (4x).- For 5s (at ~12-24fps): Use
61or49. - Formula:
(frames - 1) % 4 == 0.
- For 5s (at ~12-24fps): Use
validation_guidance:5.0.frame_rate: Default is 24.
To reduce flicker and keep subjects stable across frames:
- In Training → Loss functions, enable CREPA.
- Recommended starting values: Block Index = 8, Weight = 0.5, Adjacent Distance = 1, Temporal Decay = 1.0.
- Keep the default vision encoder (
dinov2_vitg14, size518) unless you need a smaller one (dinov2_vits14+224). - Requires network (or a cached torch hub) to fetch DINOv2 weights the first time.
- Only enable Drop VAE Encoder if you are training entirely from cached latents; otherwise leave it off.
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance.
- Scheduled Sampling (Rollout): reduces exposure bias and improves output quality by letting the model generate its own inputs during training.
⚠️ These features increase the computational overhead of training.
Video datasets require careful setup. Create config/multidatabackend.json:
[
{
"id": "my-video-dataset",
"type": "local",
"dataset_type": "video",
"instance_data_dir": "datasets/videos",
"caption_strategy": "textfile",
"resolution": 512,
"video": {
"num_frames": 61,
"min_frames": 61,
"frame_rate": 24,
"bucket_strategy": "aspect_ratio"
},
"repeats": 10
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/kandinsky5",
"disabled": false
}
]In the video subsection:
num_frames: Target frame count for training.min_frames: Minimum video length (shorter videos are discarded).max_frames: Maximum video length filter.bucket_strategy: How videos are grouped into buckets:aspect_ratio(default): Group by spatial aspect ratio only.resolution_frames: Group byWxH@Fformat (e.g.,1920x1080@61) for mixed-resolution/duration datasets.
frame_interval: When usingresolution_frames, round frame counts to this interval.
See caption_strategy options and requirements in DATALOADER.md.
mkdir -p datasets/videos
</details>
# Place .mp4 / .mov files here.
# Place corresponding .txt files with same filename for captions.wandb login
huggingface-cli loginsimpletuner trainVideo training is extremely demanding. If you OOM:
- Reduce Resolution: Try 480p (
480x854or similar). - Reduce Frames: Drop
validation_num_video_framesand datasetnum_framesto33or49. - Check Offload: Ensure
--enable_group_offloadis active.
- Black/Noise Videos: Often caused by
validation_guidancebeing too high (> 6.0) or too low (< 2.0). Stick to5.0. - Motion Jitter: Check if your dataset frame rate matches the model's trained frame rate (often 24fps).
- Stagnant/Static Video: The model might be undertrained or the prompt isn't describing motion. Use prompts like "camera pans right", "zoom in", "running", etc.
TREAD works for video too and is highly recommended to save compute.
Add to config.json:
View example config
{
"tread_config": {
"routes": [
{
"selection_ratio": 0.5,
"start_layer_idx": 2,
"end_layer_idx": -2
}
]
}
}This can speed up training by ~25-40% depending on the ratio.
If using i2v flavours:
- SimpleTuner automatically extracts the first frame of training videos to use as the conditioning image.
- The pipeline automatically masks the first frame during training.
For validation with i2v models, you have two options:
-
Auto-extracted first frame: By default, validation uses the first frame from video samples.
-
Separate image dataset (simpler setup): Use
--validation_using_datasets=truewith--eval_dataset_idpointing to an image dataset:
{
"validation_using_datasets": true,
"eval_dataset_id": "my-image-dataset"
}This allows using any image dataset as the first-frame conditioning input for validation videos, without needing the complex conditioning dataset pairing used during training.