LongCat‑Video Quickstart

LongCat‑Video is a 13.6B bilingual (zh/en) text‑to‑video and image‑to‑video model that uses flow matching, the Qwen‑2.5‑VL text encoder, and the Wan VAE. This guide walks you through setup, data prep, and running a first training/validation with SimpleTuner.

1) Hardware requirements (what to expect)

13.6B transformer + Wan VAE: expect higher VRAM than image models; start with train_batch_size=1, gradient checkpointing, and low LoRA ranks.
System RAM: more than 32 GB is helpful for multi‑frame clips; keep datasets on fast storage.
Apple MPS: supported for previews; positional encodings downcast to float32 automatically.

2) Prerequisites

Verify Python 3.12 (SimpleTuner ships a .venv by default):
```
python --version
```

Install SimpleTuner with the backend that matches your hardware:

pip install "simpletuner[cuda]"   # NVIDIA
pip install "simpletuner[mps]"    # Apple Silicon
pip install "simpletuner[cpu]"    # CPU-only

Quantisation is built in (int8-quanto, int4-quanto, fp8-torchao) and does not need extra manual installs in normal setups.

3) Environment setup

Web UI

simpletuner server

Open http://localhost:8001 and pick model family longcat_video.

CLI baseline (config/config.json)

{
  "model_type": "lora",
  "model_family": "longcat_video",
  "model_flavour": "final",
  "pretrained_model_name_or_path": null,      // auto-selected from flavour
  "base_model_precision": "bf16",             // int8-quanto/fp8-torchao also work for LoRA
  "train_batch_size": 1,
  "gradient_checkpointing": true,
  "lora_rank": 8,
  "learning_rate": 1e-4,
  "validation_resolution": "480x832",
  "validation_num_video_frames": 93,
  "validation_num_inference_steps": 40,
  "validation_guidance": 4.0
}

Key defaults to keep

Flow‑matching scheduler with shift 12.0 is automatic; no custom noise flags needed.
Aspect buckets stay 64‑pixel aligned; aspect_bucket_alignment is enforced at 64.
Max token length 512 (Qwen‑2.5‑VL); the pipeline auto‑adds empty negatives when CFG is on and no negative prompt is provided.
Frames must satisfy (num_frames - 1) divisible by the VAE temporal stride (default 4). The default 93 frames already matches this.

Optional VRAM savers:

Reduce lora_rank (4–8) and use int8-quanto base precision.
Enable group offload: --enable_group_offload --group_offload_type block_level --group_offload_blocks_per_group 1.
Lower validation_resolution, frames, or steps first if previews OOM.
Attention defaults: on CUDA, LongCat‑Video will automatically use the bundled block‑sparse Triton kernel when it’s available and fall back to the standard dispatcher otherwise. No toggle needed. If you specifically want xFormers, set attention_implementation: "xformers" in your config/CLI.

Start training (CLI)

simpletuner train --config config/config.json

Or launch the Web UI and submit a job with the same config.

4) Dataloader guidance

Use captioned video datasets; each sample should provide frames (or a short clip) plus a text caption. dataset_type: video is handled automatically via VideoToTensor.
Keep frame dimensions on the 64px grid (e.g., 480x832, 720p buckets). Height/width must be divisible by the Wan VAE stride (16px with the built‑in settings) and by 64 for bucketing.
For image‑to‑video runs, include a conditioning image per sample; it is placed in the first latent frame and kept fixed during sampling.
LongCat‑Video is 30 fps by design. The default 93 frames is ~3.1 s; if you change frame counts, keep (frames - 1) % 4 == 0 and remember duration scales with fps.

Video bucket strategy

In your dataset's video section, you can configure how videos are grouped:

bucket_strategy: aspect_ratio (default) groups by spatial aspect ratio. resolution_frames groups by WxH@F format (e.g., 480x832@93) for mixed-resolution/duration datasets.
frame_interval: When using resolution_frames, round frame counts to this interval (e.g., set to 4 to match the VAE temporal stride).

5) Validation & inference

Guidance: 3.5–5.0 works well; empty negative prompts are auto‑generated when CFG is enabled.
Steps: 35–45 for quality checks; lower for quick previews.
Frames: 93 by default (aligns with the VAE temporal stride of 4).
Need more headroom for previews or training? Set musubi_blocks_to_swap (try 4–8) and optionally musubi_block_swap_device to stream the last transformer blocks from CPU while running forward/backward. Expect extra transfer overhead but lower VRAM peaks.
Validation runs from the validation_* fields in your config or via the WebUI preview tab after simpletuner server is started. Use those paths for quick checks instead of a standalone CLI subcommand.
For dataset-driven validation (including I2V), set validation_using_datasets: true and point eval_dataset_id at your validation split. If that split is marked is_i2v and has linked conditioning frames, the pipeline keeps the first frame fixed automatically.
Latent previews are unpacked before decoding to avoid channel mismatches.

6) Troubleshooting

Height/width errors: ensure both are divisible by 16 and stay on the 64px grid.
MPS float64 warnings: handled internally; keep precision at bf16/float32.
OOM: drop validation resolution or frames first, lower LoRA rank, enable group offload, or switch to int8-quanto/fp8-torchao.
Blank negatives with CFG: if you omit a negative prompt, the pipeline inserts an empty one automatically.

7) Flavours

final: main LongCat‑Video release (text‑to‑video + image‑to‑video in one checkpoint).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongCat‑Video Quickstart

1) Hardware requirements (what to expect)

2) Prerequisites

3) Environment setup

Web UI

CLI baseline (config/config.json)

Start training (CLI)

4) Dataloader guidance

Video bucket strategy

5) Validation & inference

6) Troubleshooting

7) Flavours

FilesExpand file tree

LONGCAT_VIDEO.md

Latest commit

History

LONGCAT_VIDEO.md

File metadata and controls

LongCat‑Video Quickstart

1) Hardware requirements (what to expect)

2) Prerequisites

3) Environment setup

Web UI

CLI baseline (config/config.json)

Start training (CLI)

4) Dataloader guidance

Video bucket strategy

5) Validation & inference

6) Troubleshooting

7) Flavours