|
| 1 | +--- |
| 2 | +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +# SPDX-License-Identifier: Apache-2.0 |
| 4 | +sidebar-title: FastVideo |
| 5 | +--- |
| 6 | + |
| 7 | +# FastVideo |
| 8 | + |
| 9 | +This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo) text-to-video generation on Dynamo using a custom worker (`worker.py`) exposed through the `/v1/videos` endpoint. |
| 10 | + |
| 11 | +> [!NOTE] |
| 12 | +> Dynamo also supports diffusion through built-in backends: [SGLang Diffusion](../../backends/sglang/sglang-diffusion.md) (LLM diffusion, image, video), [vLLM-Omni](../../backends/vllm/vllm-omni.md) (text-to-image, text-to-video), and [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md). See the [Diffusion Overview](README.md) for the full support matrix. |
| 13 | +
|
| 14 | +## Overview |
| 15 | + |
| 16 | +- **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5. |
| 17 | +- **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture. |
| 18 | +- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput. |
| 19 | +- **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming). |
| 20 | +- **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers. |
| 21 | + |
| 22 | +> [!IMPORTANT] |
| 23 | +> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility. |
| 24 | +
|
| 25 | +## Docker Image Build |
| 26 | + |
| 27 | +The local Docker workflow builds a runtime image from the [`Dockerfile`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers/Dockerfile): |
| 28 | + |
| 29 | +- Base image: `nvidia/cuda:13.1.1-devel-ubuntu24.04` |
| 30 | +- Installs [FastVideo](https://github.com/hao-ai-lab/FastVideo) from GitHub |
| 31 | +- Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support) |
| 32 | +- Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source |
| 33 | + |
| 34 | +> [!WARNING] |
| 35 | +> The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores. |
| 36 | +
|
| 37 | +## Warmup Time |
| 38 | + |
| 39 | +On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward. |
| 40 | + |
| 41 | +> [!TIP] |
| 42 | +> When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts. |
| 43 | +
|
| 44 | +## Local Deployment |
| 45 | + |
| 46 | +### Prerequisites |
| 47 | + |
| 48 | +**For Docker Compose:** |
| 49 | + |
| 50 | +- Docker Engine 26.0+ |
| 51 | +- Docker Compose v2 |
| 52 | +- NVIDIA Container Toolkit |
| 53 | + |
| 54 | +**For host-local script:** |
| 55 | + |
| 56 | +- Python environment with Dynamo + FastVideo dependencies installed |
| 57 | +- CUDA-compatible GPU runtime available on host |
| 58 | + |
| 59 | +### Option 1: Docker Compose |
| 60 | + |
| 61 | +```bash |
| 62 | +cd <dynamo-root>/examples/diffusers/local |
| 63 | + |
| 64 | +# Start 4 workers on GPUs 0..3 |
| 65 | +COMPOSE_PROFILES=4 docker compose up --build |
| 66 | +``` |
| 67 | + |
| 68 | +The Compose file builds from the Dockerfile and exposes the API on `http://localhost:8000`. See the [Docker Image Build](#docker-image-build) section for build time expectations. |
| 69 | + |
| 70 | +### Option 2: Host-Local Script |
| 71 | + |
| 72 | +```bash |
| 73 | +cd <dynamo-root>/examples/diffusers/local |
| 74 | +./run_local.sh |
| 75 | +``` |
| 76 | + |
| 77 | +Environment variables: |
| 78 | + |
| 79 | +| Variable | Default | Description | |
| 80 | +|---|---|---| |
| 81 | +| `PYTHON_BIN` | `python3` | Python interpreter | |
| 82 | +| `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path | |
| 83 | +| `NUM_GPUS` | `1` | Number of GPUs | |
| 84 | +| `HTTP_PORT` | `8000` | Frontend HTTP port | |
| 85 | +| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (e.g., `--disable-optimizations`) | |
| 86 | +| `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` | |
| 87 | + |
| 88 | +Example: |
| 89 | + |
| 90 | +```bash |
| 91 | +MODEL=FastVideo/LTX2-Distilled-Diffusers \ |
| 92 | +NUM_GPUS=1 \ |
| 93 | +HTTP_PORT=8000 \ |
| 94 | +WORKER_EXTRA_ARGS="--disable-optimizations" \ |
| 95 | +./run_local.sh |
| 96 | +``` |
| 97 | + |
| 98 | +> [!NOTE] |
| 99 | +> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`. |
| 100 | +
|
| 101 | +The script writes logs to: |
| 102 | + |
| 103 | +- `.runtime/logs/worker.log` |
| 104 | +- `.runtime/logs/frontend.log` |
| 105 | + |
| 106 | +## Kubernetes Deployment |
| 107 | + |
| 108 | +### Files |
| 109 | + |
| 110 | +| File | Description | |
| 111 | +|---|---| |
| 112 | +| `agg.yaml` | Base aggregated deployment (Frontend + `FastVideoWorker`) | |
| 113 | +| `agg_user_workload.yaml` | Same deployment with `user-workload` tolerations and `imagePullSecrets` | |
| 114 | +| `huggingface-cache-pvc.yaml` | Shared HF cache PVC for model weights | |
| 115 | +| `dynamo-platform-values-user-workload.yaml` | Optional Helm values for clusters with tainted `user-workload` nodes | |
| 116 | + |
| 117 | +### Prerequisites |
| 118 | + |
| 119 | +1. Dynamo Kubernetes Platform installed |
| 120 | +2. GPU-enabled Kubernetes cluster |
| 121 | +3. FastVideo runtime image pushed to your registry |
| 122 | +4. Optional HF token secret (for gated models) |
| 123 | + |
| 124 | +Create a Hugging Face token secret if needed: |
| 125 | + |
| 126 | +```bash |
| 127 | +export NAMESPACE=<your-namespace> |
| 128 | +export HF_TOKEN=<your-hf-token> |
| 129 | +kubectl create secret generic hf-token-secret \ |
| 130 | + --from-literal=HF_TOKEN=${HF_TOKEN} \ |
| 131 | + -n ${NAMESPACE} |
| 132 | +``` |
| 133 | + |
| 134 | +### Deploy |
| 135 | + |
| 136 | +```bash |
| 137 | +cd <dynamo-root>/examples/diffusers/deploy |
| 138 | +export NAMESPACE=<your-namespace> |
| 139 | + |
| 140 | +kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE} |
| 141 | +kubectl apply -f agg.yaml -n ${NAMESPACE} |
| 142 | +``` |
| 143 | + |
| 144 | +For clusters with tainted `user-workload` nodes and private registry pulls: |
| 145 | + |
| 146 | +1. Set your pull secret name and image in `agg_user_workload.yaml`. |
| 147 | +2. Apply: |
| 148 | + |
| 149 | +```bash |
| 150 | +kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE} |
| 151 | +kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE} |
| 152 | +``` |
| 153 | + |
| 154 | +### Update Image Quickly |
| 155 | + |
| 156 | +```bash |
| 157 | +export DEPLOYMENT_FILE=agg.yaml |
| 158 | +export FASTVIDEO_IMAGE=<my-registry/fastvideo-runtime:my-tag> |
| 159 | + |
| 160 | +yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \ |
| 161 | + ${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated |
| 162 | + |
| 163 | +kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE} |
| 164 | +``` |
| 165 | + |
| 166 | +### Verify and Access |
| 167 | + |
| 168 | +```bash |
| 169 | +kubectl get dgd -n ${NAMESPACE} |
| 170 | +kubectl get pods -n ${NAMESPACE} |
| 171 | +kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker |
| 172 | +``` |
| 173 | + |
| 174 | +```bash |
| 175 | +kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000 |
| 176 | +``` |
| 177 | + |
| 178 | +## Test Request |
| 179 | + |
| 180 | +> [!NOTE] |
| 181 | +> If this is the first request after startup, expect it to take longer while warmup completes. See [Warmup Time](#warmup-time) for details. |
| 182 | +
|
| 183 | +Send a request and decode the response: |
| 184 | + |
| 185 | +```bash |
| 186 | +curl -s -X POST http://localhost:8000/v1/videos \ |
| 187 | + -H 'Content-Type: application/json' \ |
| 188 | + -d '{ |
| 189 | + "model": "FastVideo/LTX2-Distilled-Diffusers", |
| 190 | + "prompt": "A cinematic drone shot over a snowy mountain range at sunrise", |
| 191 | + "size": "1920x1088", |
| 192 | + "seconds": 5, |
| 193 | + "nvext": { |
| 194 | + "fps": 24, |
| 195 | + "num_frames": 121, |
| 196 | + "num_inference_steps": 5, |
| 197 | + "guidance_scale": 1.0, |
| 198 | + "seed": 10 |
| 199 | + } |
| 200 | + }' > response.json |
| 201 | + |
| 202 | +# Linux |
| 203 | +jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4 |
| 204 | + |
| 205 | +# macOS |
| 206 | +jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4 |
| 207 | +``` |
| 208 | + |
| 209 | +## Worker Configuration Reference |
| 210 | + |
| 211 | +### CLI Flags |
| 212 | + |
| 213 | +| Flag | Default | Description | |
| 214 | +|---|---|---| |
| 215 | +| `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path | |
| 216 | +| `--num-gpus` | `1` | Number of GPUs for distributed inference | |
| 217 | +| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA | |
| 218 | + |
| 219 | +### Request Parameters (`nvext`) |
| 220 | + |
| 221 | +| Field | Default | Description | |
| 222 | +|---|---|---| |
| 223 | +| `fps` | `24` | Frames per second | |
| 224 | +| `num_frames` | `121` | Total frames; overrides `fps * seconds` when set | |
| 225 | +| `num_inference_steps` | `5` | Diffusion inference steps | |
| 226 | +| `guidance_scale` | `1.0` | Classifier-free guidance scale | |
| 227 | +| `seed` | `10` | RNG seed for reproducibility | |
| 228 | +| `negative_prompt` | — | Text to avoid in generation | |
| 229 | + |
| 230 | +### Environment Variables |
| 231 | + |
| 232 | +| Variable | Default | Description | |
| 233 | +|---|---|---| |
| 234 | +| `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding | |
| 235 | +| `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset | |
| 236 | +| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) | |
| 237 | +| `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs | |
| 238 | +| `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging | |
| 239 | + |
| 240 | +## Troubleshooting |
| 241 | + |
| 242 | +| Symptom | Cause | Fix | |
| 243 | +|---|---|---| |
| 244 | +| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile | |
| 245 | +| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached | |
| 246 | +| ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward | |
| 247 | +| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` | |
| 248 | + |
| 249 | +## Source Code |
| 250 | + |
| 251 | +The example source lives at [`examples/diffusers/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers) in the Dynamo repository. |
| 252 | + |
| 253 | +## See Also |
| 254 | + |
| 255 | +- [vLLM-Omni Text-to-Video](../../backends/vllm/vllm-omni.md#text-to-video) — vLLM-Omni video generation via `/v1/videos` |
| 256 | +- [vLLM-Omni Text-to-Image](../../backends/vllm/vllm-omni.md#text-to-image) — vLLM-Omni image generation |
| 257 | +- [SGLang Video Generation](../../backends/sglang/sglang-diffusion.md#video-generation) — SGLang video generation worker |
| 258 | +- [SGLang Image Diffusion](../../backends/sglang/sglang-diffusion.md#image-diffusion) — SGLang image diffusion worker |
| 259 | +- [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md#quick-start) — TensorRT-LLM video diffusion quick start |
| 260 | +- [Diffusion Overview](README.md) — Full backend support matrix |
0 commit comments