Skip to content

Commit 2e3605e

Browse files
authored
docs: add FastVideo example and guide with light sidebar reorg #7283 (#7284)
Signed-off-by: Dan Gil <dagil@nvidia.com>
1 parent 50fc034 commit 2e3605e

File tree

18 files changed

+1219
-21
lines changed

18 files changed

+1219
-21
lines changed

docs/features/agentic_workloads.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
# SPDX-License-Identifier: Apache-2.0
4-
title: Agentic Workloads
4+
title: Agents
55
subtitle: Workload-aware inference with agentic hints for routing, scheduling, and KV cache Management
66
---
77

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,4 @@ For deployment guides, configuration, and examples for each backend:
3030
- **[vLLM-Omni](../../backends/vllm/vllm-omni.md)**
3131
- **[SGLang Diffusion](../../backends/sglang/sglang-diffusion.md)**
3232
- **[TRT-LLM Diffusion](../../backends/trtllm/trtllm-video-diffusion.md)**
33+
- **[FastVideo (custom worker)](fastvideo.md)**
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
---
2+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
sidebar-title: FastVideo
5+
---
6+
7+
# FastVideo
8+
9+
This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo) text-to-video generation on Dynamo using a custom worker (`worker.py`) exposed through the `/v1/videos` endpoint.
10+
11+
> [!NOTE]
12+
> Dynamo also supports diffusion through built-in backends: [SGLang Diffusion](../../backends/sglang/sglang-diffusion.md) (LLM diffusion, image, video), [vLLM-Omni](../../backends/vllm/vllm-omni.md) (text-to-image, text-to-video), and [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md). See the [Diffusion Overview](README.md) for the full support matrix.
13+
14+
## Overview
15+
16+
- **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
17+
- **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
18+
- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput.
19+
- **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming).
20+
- **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
21+
22+
> [!IMPORTANT]
23+
> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
24+
25+
## Docker Image Build
26+
27+
The local Docker workflow builds a runtime image from the [`Dockerfile`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers/Dockerfile):
28+
29+
- Base image: `nvidia/cuda:13.1.1-devel-ubuntu24.04`
30+
- Installs [FastVideo](https://github.com/hao-ai-lab/FastVideo) from GitHub
31+
- Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support)
32+
- Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source
33+
34+
> [!WARNING]
35+
> The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
36+
37+
## Warmup Time
38+
39+
On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
40+
41+
> [!TIP]
42+
> When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts.
43+
44+
## Local Deployment
45+
46+
### Prerequisites
47+
48+
**For Docker Compose:**
49+
50+
- Docker Engine 26.0+
51+
- Docker Compose v2
52+
- NVIDIA Container Toolkit
53+
54+
**For host-local script:**
55+
56+
- Python environment with Dynamo + FastVideo dependencies installed
57+
- CUDA-compatible GPU runtime available on host
58+
59+
### Option 1: Docker Compose
60+
61+
```bash
62+
cd <dynamo-root>/examples/diffusers/local
63+
64+
# Start 4 workers on GPUs 0..3
65+
COMPOSE_PROFILES=4 docker compose up --build
66+
```
67+
68+
The Compose file builds from the Dockerfile and exposes the API on `http://localhost:8000`. See the [Docker Image Build](#docker-image-build) section for build time expectations.
69+
70+
### Option 2: Host-Local Script
71+
72+
```bash
73+
cd <dynamo-root>/examples/diffusers/local
74+
./run_local.sh
75+
```
76+
77+
Environment variables:
78+
79+
| Variable | Default | Description |
80+
|---|---|---|
81+
| `PYTHON_BIN` | `python3` | Python interpreter |
82+
| `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
83+
| `NUM_GPUS` | `1` | Number of GPUs |
84+
| `HTTP_PORT` | `8000` | Frontend HTTP port |
85+
| `WORKER_EXTRA_ARGS` || Extra flags for `worker.py` (e.g., `--disable-optimizations`) |
86+
| `FRONTEND_EXTRA_ARGS` || Extra flags for `dynamo.frontend` |
87+
88+
Example:
89+
90+
```bash
91+
MODEL=FastVideo/LTX2-Distilled-Diffusers \
92+
NUM_GPUS=1 \
93+
HTTP_PORT=8000 \
94+
WORKER_EXTRA_ARGS="--disable-optimizations" \
95+
./run_local.sh
96+
```
97+
98+
> [!NOTE]
99+
> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`.
100+
101+
The script writes logs to:
102+
103+
- `.runtime/logs/worker.log`
104+
- `.runtime/logs/frontend.log`
105+
106+
## Kubernetes Deployment
107+
108+
### Files
109+
110+
| File | Description |
111+
|---|---|
112+
| `agg.yaml` | Base aggregated deployment (Frontend + `FastVideoWorker`) |
113+
| `agg_user_workload.yaml` | Same deployment with `user-workload` tolerations and `imagePullSecrets` |
114+
| `huggingface-cache-pvc.yaml` | Shared HF cache PVC for model weights |
115+
| `dynamo-platform-values-user-workload.yaml` | Optional Helm values for clusters with tainted `user-workload` nodes |
116+
117+
### Prerequisites
118+
119+
1. Dynamo Kubernetes Platform installed
120+
2. GPU-enabled Kubernetes cluster
121+
3. FastVideo runtime image pushed to your registry
122+
4. Optional HF token secret (for gated models)
123+
124+
Create a Hugging Face token secret if needed:
125+
126+
```bash
127+
export NAMESPACE=<your-namespace>
128+
export HF_TOKEN=<your-hf-token>
129+
kubectl create secret generic hf-token-secret \
130+
--from-literal=HF_TOKEN=${HF_TOKEN} \
131+
-n ${NAMESPACE}
132+
```
133+
134+
### Deploy
135+
136+
```bash
137+
cd <dynamo-root>/examples/diffusers/deploy
138+
export NAMESPACE=<your-namespace>
139+
140+
kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
141+
kubectl apply -f agg.yaml -n ${NAMESPACE}
142+
```
143+
144+
For clusters with tainted `user-workload` nodes and private registry pulls:
145+
146+
1. Set your pull secret name and image in `agg_user_workload.yaml`.
147+
2. Apply:
148+
149+
```bash
150+
kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
151+
kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE}
152+
```
153+
154+
### Update Image Quickly
155+
156+
```bash
157+
export DEPLOYMENT_FILE=agg.yaml
158+
export FASTVIDEO_IMAGE=<my-registry/fastvideo-runtime:my-tag>
159+
160+
yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \
161+
${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated
162+
163+
kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE}
164+
```
165+
166+
### Verify and Access
167+
168+
```bash
169+
kubectl get dgd -n ${NAMESPACE}
170+
kubectl get pods -n ${NAMESPACE}
171+
kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker
172+
```
173+
174+
```bash
175+
kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000
176+
```
177+
178+
## Test Request
179+
180+
> [!NOTE]
181+
> If this is the first request after startup, expect it to take longer while warmup completes. See [Warmup Time](#warmup-time) for details.
182+
183+
Send a request and decode the response:
184+
185+
```bash
186+
curl -s -X POST http://localhost:8000/v1/videos \
187+
-H 'Content-Type: application/json' \
188+
-d '{
189+
"model": "FastVideo/LTX2-Distilled-Diffusers",
190+
"prompt": "A cinematic drone shot over a snowy mountain range at sunrise",
191+
"size": "1920x1088",
192+
"seconds": 5,
193+
"nvext": {
194+
"fps": 24,
195+
"num_frames": 121,
196+
"num_inference_steps": 5,
197+
"guidance_scale": 1.0,
198+
"seed": 10
199+
}
200+
}' > response.json
201+
202+
# Linux
203+
jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4
204+
205+
# macOS
206+
jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
207+
```
208+
209+
## Worker Configuration Reference
210+
211+
### CLI Flags
212+
213+
| Flag | Default | Description |
214+
|---|---|---|
215+
| `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
216+
| `--num-gpus` | `1` | Number of GPUs for distributed inference |
217+
| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA |
218+
219+
### Request Parameters (`nvext`)
220+
221+
| Field | Default | Description |
222+
|---|---|---|
223+
| `fps` | `24` | Frames per second |
224+
| `num_frames` | `121` | Total frames; overrides `fps * seconds` when set |
225+
| `num_inference_steps` | `5` | Diffusion inference steps |
226+
| `guidance_scale` | `1.0` | Classifier-free guidance scale |
227+
| `seed` | `10` | RNG seed for reproducibility |
228+
| `negative_prompt` || Text to avoid in generation |
229+
230+
### Environment Variables
231+
232+
| Variable | Default | Description |
233+
|---|---|---|
234+
| `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding |
235+
| `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset |
236+
| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) |
237+
| `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs |
238+
| `FASTVIDEO_LOG_LEVEL` || Set to `DEBUG` for verbose logging |
239+
240+
## Troubleshooting
241+
242+
| Symptom | Cause | Fix |
243+
|---|---|---|
244+
| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile |
245+
| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
246+
| ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward |
247+
| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` |
248+
249+
## Source Code
250+
251+
The example source lives at [`examples/diffusers/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers) in the Dynamo repository.
252+
253+
## See Also
254+
255+
- [vLLM-Omni Text-to-Video](../../backends/vllm/vllm-omni.md#text-to-video) — vLLM-Omni video generation via `/v1/videos`
256+
- [vLLM-Omni Text-to-Image](../../backends/vllm/vllm-omni.md#text-to-image) — vLLM-Omni image generation
257+
- [SGLang Video Generation](../../backends/sglang/sglang-diffusion.md#video-generation) — SGLang video generation worker
258+
- [SGLang Image Diffusion](../../backends/sglang/sglang-diffusion.md#image-diffusion) — SGLang image diffusion worker
259+
- [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md#quick-start) — TensorRT-LLM video diffusion quick start
260+
- [Diffusion Overview](README.md) — Full backend support matrix

docs/index.yml

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -89,31 +89,34 @@ navigation:
8989
path: components/kvbm/kvbm-guide.md
9090
- page: Dynamo Benchmarking
9191
path: benchmarks/benchmarking.md
92-
- section: Multimodal Model Serving
92+
- section: Multimodal
93+
path: features/multimodal/README.md
9394
contents:
94-
- section: Vision Language Models (VLMs)
95-
path: features/multimodal/README.md
96-
contents:
97-
- page: Embedding Cache
98-
path: features/multimodal/embedding-cache.md
99-
- page: Encoder Disaggregation
100-
path: features/multimodal/encoder-disaggregation.md
101-
- page: Multimodal KV Routing
102-
path: features/multimodal/multimodal-kv-routing.md
103-
- section: Diffusion (Experimental)
104-
path: features/multimodal/diffusion.md
105-
contents:
106-
- page: vLLM-Omni
107-
path: backends/vllm/vllm-omni.md
108-
- page: SGLang Diffusion
109-
path: backends/sglang/sglang-diffusion.md
110-
- page: TRT-LLM Diffusion
111-
path: backends/trtllm/trtllm-video-diffusion.md
95+
- page: Embedding Cache
96+
path: features/multimodal/embedding-cache.md
97+
- page: Encoder Disaggregation
98+
path: features/multimodal/encoder-disaggregation.md
99+
- page: Multimodal KV Routing
100+
path: features/multimodal/multimodal-kv-routing.md
101+
- section: Diffusion (Preview)
102+
slug: diffusion
103+
path: features/diffusion/README.md
104+
contents:
105+
- page: FastVideo
106+
slug: fastvideo
107+
path: features/diffusion/fastvideo.md
108+
- page: vLLM-Omni
109+
path: backends/vllm/vllm-omni.md
110+
- page: SGLang Diffusion
111+
path: backends/sglang/sglang-diffusion.md
112+
- page: TRT-LLM Diffusion
113+
path: backends/trtllm/trtllm-video-diffusion.md
112114
- page: Tool Calling
113115
path: agents/tool-calling.md
114116
- page: LoRA Adapters
115117
path: features/lora/README.md
116-
- section: Agentic Workloads
118+
- section: Agents
119+
slug: agents
117120
path: features/agentic_workloads.md
118121
contents:
119122
- page: SGLang for Agentic Workloads

examples/diffusers/.dockerignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# Local outputs
5+
outputs
6+
outputs_video
7+
8+
# Python caches
9+
__pycache__
10+
*.pyc
11+
.git
12+
local/.runtime

examples/diffusers/Dockerfile

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# Shared runtime image for Dynamo frontend and FastVideo workers.
5+
FROM nvidia/cuda:13.1.1-devel-ubuntu24.04
6+
7+
RUN apt-get update \
8+
&& apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \
9+
&& apt-get clean
10+
11+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
12+
ENV UV_LINK_MODE=copy
13+
14+
RUN uv venv /opt/dynamo/venv --python 3.12 \
15+
&& . /opt/dynamo/venv/bin/activate \
16+
&& uv pip install pip setuptools packaging ninja psutil uvloop \
17+
&& uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130 \
18+
&& uv pip install flashinfer-python
19+
20+
ENV VIRTUAL_ENV=/opt/dynamo/venv
21+
ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
22+
23+
# flash-attn compilation is memory-intensive. If the build OOMs, lower MAX_JOBS.
24+
# The flash-attn install notes call this out for machines with <96GB RAM and many CPU cores.
25+
RUN git clone https://github.com/RandNMR73/flash-attention \
26+
&& cd flash-attention \
27+
&& git switch fa4-compile \
28+
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install . --no-build-isolation \
29+
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install ./flash_attn/cute \
30+
&& rm -rf ../flash-attention
31+
32+
# Install Dynamo with /v1/videos support.
33+
RUN uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0#subdirectory=lib/bindings/python' \
34+
&& uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0'
35+
36+
# Install FastVideo directly from the public upstream repository.
37+
# Checkout with --recurse-submodules to get the required submodules as well.
38+
RUN . /opt/dynamo/venv/bin/activate \
39+
&& uv pip install setuptools_scm scikit-build-core cmake ninja \
40+
&& git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \
41+
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" uv pip install --no-build-isolation /tmp/FastVideo
42+
43+
ENV FASTVIDEO_VIDEO_CODEC=libx264
44+
ENV FASTVIDEO_X264_PRESET=ultrafast
45+
46+
WORKDIR /opt/app
47+
COPY . /opt/app/

0 commit comments

Comments
 (0)