diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index a7df8c3229..eb1b9c7211 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -2,22 +2,28 @@ > **Warning:** Profiling incurs significant overhead. Use only for development and debugging, never in production. -vLLM-Omni uses the PyTorch Profiler to analyze performance across both **multi-stage omni-modality models** and **diffusion models**. +vLLM-Omni supports two profiling approaches: +- **PyTorch Profiler** — detailed CPU/CUDA traces (`*.pt.trace.json` files viewable in Perfetto) +- **Nsight Systems (nsys)** — GPU-level tracing with CUDA kernel timelines (`.nsys-rep` files) -### 1. Set the Output Directory -Before running any script, set this environment variable. The system detects this and automatically saves traces here. +### 1. Set the Output Directory (PyTorch Profiler) +Before running any profiling script, set this environment variable. The system detects this and automatically saves traces here. ```bash export VLLM_TORCH_PROFILER_DIR=./profiles ``` -### 2. Profiling Omni-Modality Models +### 2. Profiling Omni-Modality Models (Offline) It is best to limit profiling to one iteration to keep trace files manageable. ```bash export VLLM_PROFILER_MAX_ITERS=1 ``` +Optionally, skip initial warmup iterations before collecting traces: +```bash +export VLLM_PROFILER_DELAY_ITERS=1 +``` **Selective Stage Profiling** The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files: @@ -82,7 +88,7 @@ omni_llm.close() 2. **Qwen3-Omni**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py) -### 3. Profiling diffusion models +### 3. Profiling Diffusion Models (Offline) Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding. @@ -131,15 +137,47 @@ python image_to_video.py \ 2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**: [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video) -> **Note:** -As of now, asynchronous (online) profiling is not fully supported in vLLM-Omni. While start_profile() and stop_profile() methods exist, they are only reliable in offline inference scripts (e.g., the provided end2end.py examples). Do not use them in server-mode or streaming scenarios—traces may be incomplete or fail to flush. +### 4. Nsight Systems Profiling (Diffusion) + +For deeper GPU-level analysis of diffusion workloads, use NVIDIA Nsight Systems (`nsys`). Diffusion workers follow the same profiler pattern as vLLM — set `VLLM_TORCH_CUDA_PROFILE=1` to enable the CUDA profiler which signals nsys via `torch.cuda.profiler.start()/stop()`. + +**Usage:** + +```bash +# Enable CUDA profiler for nsys integration +export VLLM_TORCH_CUDA_PROFILE=1 +# Capture a fixed range of iterations (skip warmup, then capture N iters) +export VLLM_PROFILER_DELAY_ITERS=10 +export VLLM_PROFILER_MAX_ITERS=10 +# Optional: enable NVTX ranges (used by vLLM tracing) +export VLLM_PROFILER_TRACE_DIR=./vllm_trace + +nsys profile \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + --sample=none \ + --stats=true \ + -o diffusion_trace \ + python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ... +``` + +The `VLLM_TORCH_CUDA_PROFILE=1` environment variable configures diffusion workers to use vLLM's `CudaProfilerWrapper`, which brackets GPU work with `torch.cuda.profiler.start()/stop()` calls that nsys captures. + +```bash +ls diffusion_trace*.nsys-rep +nsys stats diffusion_trace.nsys-rep +``` + +Open the `.nsys-rep` file in the Nsight Systems GUI for detailed CUDA kernel timelines, memory operations, and NVTX ranges. -### 4. Analyzing Omni Traces +### 5. Analyzing Omni Traces Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```. **Output** -**Chrome Trace** (```.json.gz```): Visual timeline of kernels and stages. Open in Perfetto UI. +**Chrome Trace** (```.pt.trace.json```): Visual timeline of kernels and stages. Open in Perfetto UI. **Viewing Tools:** diff --git a/vllm_omni/diffusion/diffusion_engine.py b/vllm_omni/diffusion/diffusion_engine.py index 8e19f12426..e7b9bcf244 100644 --- a/vllm_omni/diffusion/diffusion_engine.py +++ b/vllm_omni/diffusion/diffusion_engine.py @@ -196,60 +196,46 @@ def add_req_and_wait_for_response(self, request: OmniDiffusionRequest): def start_profile(self, trace_filename: str | None = None) -> None: """ - Start torch profiling on all diffusion workers. + Start profiling on all diffusion workers. - Creates a directory (if needed) and sets up a base filename template - for per-rank profiler traces (typically saved as