[Doc] Add user guide for diffusion model profiling (#738)

lishunyang12 · hsliuustc0106 · web-flow · commit 9f552d010bcc · 2026-01-21T09:42:30.000+08:00
Signed-off-by: lishunyang &lt;lishunyang12@163.com&gt;
Co-authored-by: Hongsheng Liu &lt;liuhongsheng4@huawei.com&gt;
diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md
@@ -1,48 +1,30 @@
 # Profiling vLLM-Omni
 
-Performance Profiling Guidelines Profiling capabilities in vLLM-Omni are reserved for development and maintenance tasks aimed at temporal analysis of the codebase. Production use is **strongly discouraged**; enabling the profiler incurs a substantial overhead that negatively impacts inference latency.
+> **Warning:** Profiling incurs significant overhead. Use only for development and debugging, never in production.
 
-**Mechanism**: vLLM-Omni implements cross-stage profiling via the PyTorch Profiler. To accommodate the architecture—where stages operate as distinct engine instances in separate processes—the profiling interface supports both holistic capturing (all stages) and targeted capturing (specific stages).
+vLLM-Omni uses the PyTorch Profiler to analyze performance across both **Multi-Stage LLMs** and **Diffusion Models**.
 
-**1. Enabling the Profiler**
+### 1. Set the Output Directory
+Before running any script, set this environment variable. The system detects this and automatically saves traces here.
 
-Before running your script, you must set the ```VLLM_TORCH_PROFILER_DIR``` environment
-variable.
-
-```Bash
-export VLLM_TORCH_PROFILER_DIR=/path/to/save/traces
+```bash
+export VLLM_TORCH_PROFILER_DIR=./profiles
 ```
 
-**Highly Recommended: Limit Profiling to a Single Iteration**  
-For most use cases (especially when profiling audio stages), you should limit the profiler to just **one iteration** to keep trace files small and readable.
+### 2. Start Profiling
 
+It is best to limit profiling to one iteration to keep trace files manageable.
 
 ```bash
 export VLLM_PROFILER_MAX_ITERS=1
 ```
 
-**2. Offline Inference**
-
-For offlinie processing using ```OmniLLM```, you can wrap your ```generate``` calls with ```start_profile``` and ```stop_profile()```.
-
-Basic Usage(All Stages)
-```Python
-from vllm_omni import OmniLLM
-
-omni_llm = OmniLLM.from_engine_args(engine_args)
-
-# Start profiling all active stages
-omni_llm.start_profile()
-
-outputs = omni_llm.generate(prompts, sampling_params)
-
-# Stop profiling and save traces
-omni_llm.stop_profile()
-```
-
 **Selective Stage Profiling**
 The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:
 ```python
+# Profile all stages
+omni_llm.start_profile()
+
 # Only profile Stage 1
 omni_llm.start_profile(stages=[1])
 ```
@@ -52,13 +34,105 @@ omni_llm.start_profile(stages=[1])
 omni_llm.start_profile(stages=[0, 2])
 ```
 
+**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
+
+```python
+from vllm_omni import omni_llm
+
+profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+
+# 1. Start profiling if enabled
+if profiler_enabled:
+    omni_llm.start_profile(stages=[0])
+
+# Initialize generator
+omni_generator = omni_llm.generate(prompts, sampling_params_list, py_generator=args.py_generator)
+
+total_requests = len(prompts)
+processed_count = 0
+
+# Main Processing Loop
+for stage_outputs in omni_generator:
+
+    # ... [Output processing logic for text/audio would go here] ...
+
+    # Update count to track when to stop profiling
+    processed_count += len(stage_outputs.request_output)
+
+    # 2. Check if all requests are done to stop the profiler safely
+    if profiler_enabled and processed_count >= total_requests:
+        print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
+
+        # Stop the profiler while workers are still active
+        omni_llm.stop_profile()
+
+        # Wait for traces to flush to disk
+        print("[Info] Waiting 30s for workers to write trace files to disk...")
+        time.sleep(30)
+        print("[Info] Trace export wait time finished.")
+
+omni_llm.close()
+```
+
+
 **Examples**:
 
 1. **Qwen-omni 2.5**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
 
 2. **Qwen-omni 3.0**:   [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
 
-**3. Online Inference(Async)**
+
+**For Diffusion Models as a single stage**
+
+Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
+
+**CLI Usage:**
+```python
+
+python image_to_video.py \
+    --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+    --image qwen-bear.png \
+    --prompt "A cat playing with yarn, smooth motion" \
+    \
+    # Minimize Spatial Dimensions (Optional but helpful):
+    #    Drastically reduces memory usage so the profiler doesn't
+    #    crash due to overhead, though for accurate performance
+    #    tuning you often want target resolutions.
+    --height 48 \
+    --width 64 \
+    \
+    # Minimize Temporal Dimension (Frames):
+    #    Video models process 3D tensors (Time, Height, Width).
+    #    Reducing frames to the absolute minimum (2) keeps the
+    #    tensor size small, ensuring the trace file doesn't become
+    #    multi-gigabytes in size.
+    --num_frames 2 \
+    \
+    # Minimize Iteration Loop (Steps):
+    #    This is the most critical setting for profiling.
+    #    Diffusion models run the same loop X times.
+    #    Profiling 2 steps gives you the exact same performance
+    #    data as 50 steps, but saves minutes of runtime and
+    #    prevents the trace viewer from freezing.
+    --num_inference_steps 2 \
+    \
+    --guidance_scale 5.0 \
+    --guidance_scale_high 6.0 \
+    --boundary_ratio 0.875 \
+    --flow_shift 12.0 \
+    --fps 16 \
+    --output i2v_output.mp4
+
+```
+
+**Examples**:
+
+1. **Qwen image edit**:  [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
+
+2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**:   [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
+
+
+**Online Inference(Async)**
 
 For online serving using AsyncOmni, the methods are asynchronous. This allows you to toggle profiling dynamically without restarting the server.
 
@@ -77,23 +151,16 @@ async for output in async_omni.generate(prompt, sampling_params, request_id):
 await async_omni.stop_profile()
 ```
 
-**4. Analyzing Omni Traces**
+### 3. Analyzing Omni Traces
 
-After ``stop_profile()`` completes (and the file write wait time has passed), the directory specified in ```VLLM_TORCH_PROFILER_DIR``` will contain the trace files.
+Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```.
 
-```
-Output/
-│── ...rank-0.pt.trace.json.gz   # GPU 0 trace (TP=2 Example)
-│── ...rank-1.pt.trace.json.gz   # GPU 1 trace (TP=2 Example)
-│       # Load these into Perfetto to visualize synchronization
-│
-│── profiler_out_.txt            # Summary tables (CPU/CUDA time %)
-```
+**Output**
+**Chrome Trace** (```.json.gz```): Visual timeline of kernels and stages. Open in Perfetto UI.
 
 **Viewing Tools:**
-     - [Perfetto](https://ui.perfetto.dev/): (Recommended): Best for handling large audio trace files.
-     - ```chrome://tracing```: Good for smaller text-only traces.
 
+- [Perfetto](https://ui.perfetto.dev/)(recommended)
+- ```chrome://tracing```(Chrome only)
 
-**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM.  
-For more advanced configuration options (memory profiling, custom activities, etc.), see the official vLLM profiler documentation:  [vLLM Profiling Guide](https://docs.vllm.ai/en/latest/dev/profiling.html)
+**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation:  [vLLM Profiling Guide](https://docs.vllm.ai/en/latest/dev/profiling.html)