LongCat‑Video is a 13.6B bilingual (zh/en) text‑to‑video and image‑to‑video model that uses flow matching, the Qwen‑2.5‑VL text encoder, and the Wan VAE. This guide walks you through setup, data prep, and running a first training/validation with SimpleTuner.
- 13.6B transformer + Wan VAE: expect higher VRAM than image models; start with
train_batch_size=1, gradient checkpointing, and low LoRA ranks. - System RAM: more than 32 GB is helpful for multi‑frame clips; keep datasets on fast storage.
- Apple MPS: supported for previews; positional encodings downcast to float32 automatically.
- Verify Python 3.12 (SimpleTuner ships a
.venvby default):python --version
- Install SimpleTuner with the backend that matches your hardware:
pip install "simpletuner[cuda]" # NVIDIA pip install "simpletuner[mps]" # Apple Silicon pip install "simpletuner[cpu]" # CPU-only
- Quantisation is built in (
int8-quanto,int4-quanto,fp8-torchao) and does not need extra manual installs in normal setups.
simpletuner serverOpen http://localhost:8001 and pick model family longcat_video.
Key defaults to keep
- Flow‑matching scheduler with shift
12.0is automatic; no custom noise flags needed. - Aspect buckets stay 64‑pixel aligned;
aspect_bucket_alignmentis enforced at 64. - Max token length 512 (Qwen‑2.5‑VL); the pipeline auto‑adds empty negatives when CFG is on and no negative prompt is provided.
- Frames must satisfy
(num_frames - 1)divisible by the VAE temporal stride (default 4). The default 93 frames already matches this.
Optional VRAM savers:
- Reduce
lora_rank(4–8) and useint8-quantobase precision. - Enable group offload:
--enable_group_offload --group_offload_type block_level --group_offload_blocks_per_group 1. - Lower
validation_resolution, frames, or steps first if previews OOM. - Attention defaults: on CUDA, LongCat‑Video will automatically use the bundled block‑sparse Triton kernel when it’s available and fall back to the standard dispatcher otherwise. No toggle needed. If you specifically want xFormers, set
attention_implementation: "xformers"in your config/CLI.
simpletuner train --config config/config.jsonOr launch the Web UI and submit a job with the same config.
- Use captioned video datasets; each sample should provide frames (or a short clip) plus a text caption.
dataset_type: videois handled automatically viaVideoToTensor. - Keep frame dimensions on the 64px grid (e.g., 480x832, 720p buckets). Height/width must be divisible by the Wan VAE stride (16px with the built‑in settings) and by 64 for bucketing.
- For image‑to‑video runs, include a conditioning image per sample; it is placed in the first latent frame and kept fixed during sampling.
- LongCat‑Video is 30 fps by design. The default 93 frames is ~3.1 s; if you change frame counts, keep
(frames - 1) % 4 == 0and remember duration scales with fps.
In your dataset's video section, you can configure how videos are grouped:
bucket_strategy:aspect_ratio(default) groups by spatial aspect ratio.resolution_framesgroups byWxH@Fformat (e.g.,480x832@93) for mixed-resolution/duration datasets.frame_interval: When usingresolution_frames, round frame counts to this interval (e.g., set to 4 to match the VAE temporal stride).
-
Guidance: 3.5–5.0 works well; empty negative prompts are auto‑generated when CFG is enabled.
-
Steps: 35–45 for quality checks; lower for quick previews.
-
Frames: 93 by default (aligns with the VAE temporal stride of 4).
-
Need more headroom for previews or training? Set
musubi_blocks_to_swap(try 4–8) and optionallymusubi_block_swap_deviceto stream the last transformer blocks from CPU while running forward/backward. Expect extra transfer overhead but lower VRAM peaks. -
Validation runs from the
validation_*fields in your config or via the WebUI preview tab aftersimpletuner serveris started. Use those paths for quick checks instead of a standalone CLI subcommand. -
For dataset-driven validation (including I2V), set
validation_using_datasets: trueand pointeval_dataset_idat your validation split. If that split is markedis_i2vand has linked conditioning frames, the pipeline keeps the first frame fixed automatically. -
Latent previews are unpacked before decoding to avoid channel mismatches.
- Height/width errors: ensure both are divisible by 16 and stay on the 64px grid.
- MPS float64 warnings: handled internally; keep precision at bf16/float32.
- OOM: drop validation resolution or frames first, lower LoRA rank, enable group offload, or switch to
int8-quanto/fp8-torchao. - Blank negatives with CFG: if you omit a negative prompt, the pipeline inserts an empty one automatically.
final: main LongCat‑Video release (text‑to‑video + image‑to‑video in one checkpoint).
{ "model_type": "lora", "model_family": "longcat_video", "model_flavour": "final", "pretrained_model_name_or_path": null, // auto-selected from flavour "base_model_precision": "bf16", // int8-quanto/fp8-torchao also work for LoRA "train_batch_size": 1, "gradient_checkpointing": true, "lora_rank": 8, "learning_rate": 1e-4, "validation_resolution": "480x832", "validation_num_video_frames": 93, "validation_num_inference_steps": 40, "validation_guidance": 4.0 }