Skip to content

Commit 9f552d0

Browse files
[Doc] Add user guide for diffusion model profiling (#738)
Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
1 parent 776c3a7 commit 9f552d0

File tree

1 file changed

+111
-44
lines changed

1 file changed

+111
-44
lines changed

docs/contributing/profiling.md

Lines changed: 111 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,30 @@
11
# Profiling vLLM-Omni
22

3-
Performance Profiling Guidelines Profiling capabilities in vLLM-Omni are reserved for development and maintenance tasks aimed at temporal analysis of the codebase. Production use is **strongly discouraged**; enabling the profiler incurs a substantial overhead that negatively impacts inference latency.
3+
> **Warning:** Profiling incurs significant overhead. Use only for development and debugging, never in production.
44
5-
**Mechanism**: vLLM-Omni implements cross-stage profiling via the PyTorch Profiler. To accommodate the architecture—where stages operate as distinct engine instances in separate processes—the profiling interface supports both holistic capturing (all stages) and targeted capturing (specific stages).
5+
vLLM-Omni uses the PyTorch Profiler to analyze performance across both **Multi-Stage LLMs** and **Diffusion Models**.
66

7-
**1. Enabling the Profiler**
7+
### 1. Set the Output Directory
8+
Before running any script, set this environment variable. The system detects this and automatically saves traces here.
89

9-
Before running your script, you must set the ```VLLM_TORCH_PROFILER_DIR``` environment
10-
variable.
11-
12-
```Bash
13-
export VLLM_TORCH_PROFILER_DIR=/path/to/save/traces
10+
```bash
11+
export VLLM_TORCH_PROFILER_DIR=./profiles
1412
```
1513

16-
**Highly Recommended: Limit Profiling to a Single Iteration**
17-
For most use cases (especially when profiling audio stages), you should limit the profiler to just **one iteration** to keep trace files small and readable.
14+
### 2. Start Profiling
1815

16+
It is best to limit profiling to one iteration to keep trace files manageable.
1917

2018
```bash
2119
export VLLM_PROFILER_MAX_ITERS=1
2220
```
2321

24-
**2. Offline Inference**
25-
26-
For offlinie processing using ```OmniLLM```, you can wrap your ```generate``` calls with ```start_profile``` and ```stop_profile()```.
27-
28-
Basic Usage(All Stages)
29-
```Python
30-
from vllm_omni import OmniLLM
31-
32-
omni_llm = OmniLLM.from_engine_args(engine_args)
33-
34-
# Start profiling all active stages
35-
omni_llm.start_profile()
36-
37-
outputs = omni_llm.generate(prompts, sampling_params)
38-
39-
# Stop profiling and save traces
40-
omni_llm.stop_profile()
41-
```
42-
4322
**Selective Stage Profiling**
4423
The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:
4524
```python
25+
# Profile all stages
26+
omni_llm.start_profile()
27+
4628
# Only profile Stage 1
4729
omni_llm.start_profile(stages=[1])
4830
```
@@ -52,13 +34,105 @@ omni_llm.start_profile(stages=[1])
5234
omni_llm.start_profile(stages=[0, 2])
5335
```
5436

37+
**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
38+
39+
```python
40+
from vllm_omni import omni_llm
41+
42+
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
43+
44+
# 1. Start profiling if enabled
45+
if profiler_enabled:
46+
omni_llm.start_profile(stages=[0])
47+
48+
# Initialize generator
49+
omni_generator = omni_llm.generate(prompts, sampling_params_list, py_generator=args.py_generator)
50+
51+
total_requests = len(prompts)
52+
processed_count = 0
53+
54+
# Main Processing Loop
55+
for stage_outputs in omni_generator:
56+
57+
# ... [Output processing logic for text/audio would go here] ...
58+
59+
# Update count to track when to stop profiling
60+
processed_count += len(stage_outputs.request_output)
61+
62+
# 2. Check if all requests are done to stop the profiler safely
63+
if profiler_enabled and processed_count >= total_requests:
64+
print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
65+
66+
# Stop the profiler while workers are still active
67+
omni_llm.stop_profile()
68+
69+
# Wait for traces to flush to disk
70+
print("[Info] Waiting 30s for workers to write trace files to disk...")
71+
time.sleep(30)
72+
print("[Info] Trace export wait time finished.")
73+
74+
omni_llm.close()
75+
```
76+
77+
5578
**Examples**:
5679

5780
1. **Qwen-omni 2.5**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen2_5_omni/end2end.py)
5881

5982
2. **Qwen-omni 3.0**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/qwen3_omni/end2end.py)
6083

61-
**3. Online Inference(Async)**
84+
85+
**For Diffusion Models as a single stage**
86+
87+
Diffusion profiling is End-to-End, capturing encoding, denoising loops, and decoding.
88+
89+
**CLI Usage:**
90+
```python
91+
92+
python image_to_video.py \
93+
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
94+
--image qwen-bear.png \
95+
--prompt "A cat playing with yarn, smooth motion" \
96+
\
97+
# Minimize Spatial Dimensions (Optional but helpful):
98+
# Drastically reduces memory usage so the profiler doesn't
99+
# crash due to overhead, though for accurate performance
100+
# tuning you often want target resolutions.
101+
--height 48 \
102+
--width 64 \
103+
\
104+
# Minimize Temporal Dimension (Frames):
105+
# Video models process 3D tensors (Time, Height, Width).
106+
# Reducing frames to the absolute minimum (2) keeps the
107+
# tensor size small, ensuring the trace file doesn't become
108+
# multi-gigabytes in size.
109+
--num_frames 2 \
110+
\
111+
# Minimize Iteration Loop (Steps):
112+
# This is the most critical setting for profiling.
113+
# Diffusion models run the same loop X times.
114+
# Profiling 2 steps gives you the exact same performance
115+
# data as 50 steps, but saves minutes of runtime and
116+
# prevents the trace viewer from freezing.
117+
--num_inference_steps 2 \
118+
\
119+
--guidance_scale 5.0 \
120+
--guidance_scale_high 6.0 \
121+
--boundary_ratio 0.875 \
122+
--flow_shift 12.0 \
123+
--fps 16 \
124+
--output i2v_output.mp4
125+
126+
```
127+
128+
**Examples**:
129+
130+
1. **Qwen image edit**: [https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
131+
132+
2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**: [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
133+
134+
135+
**Online Inference(Async)**
62136

63137
For online serving using AsyncOmni, the methods are asynchronous. This allows you to toggle profiling dynamically without restarting the server.
64138

@@ -77,23 +151,16 @@ async for output in async_omni.generate(prompt, sampling_params, request_id):
77151
await async_omni.stop_profile()
78152
```
79153

80-
**4. Analyzing Omni Traces**
154+
### 3. Analyzing Omni Traces
81155

82-
After ``stop_profile()`` completes (and the file write wait time has passed), the directory specified in ```VLLM_TORCH_PROFILER_DIR``` will contain the trace files.
156+
Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```.
83157

84-
```
85-
Output/
86-
│── ...rank-0.pt.trace.json.gz # GPU 0 trace (TP=2 Example)
87-
│── ...rank-1.pt.trace.json.gz # GPU 1 trace (TP=2 Example)
88-
│ # Load these into Perfetto to visualize synchronization
89-
90-
│── profiler_out_.txt # Summary tables (CPU/CUDA time %)
91-
```
158+
**Output**
159+
**Chrome Trace** (```.json.gz```): Visual timeline of kernels and stages. Open in Perfetto UI.
92160

93161
**Viewing Tools:**
94-
- [Perfetto](https://ui.perfetto.dev/): (Recommended): Best for handling large audio trace files.
95-
- ```chrome://tracing```: Good for smaller text-only traces.
96162

163+
- [Perfetto](https://ui.perfetto.dev/)(recommended)
164+
- ```chrome://tracing```(Chrome only)
97165

98-
**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM.
99-
For more advanced configuration options (memory profiling, custom activities, etc.), see the official vLLM profiler documentation: [vLLM Profiling Guide](https://docs.vllm.ai/en/latest/dev/profiling.html)
166+
**Note**: vLLM-Omni reuses the PyTorch Profiler infrastructure from vLLM. See the official vLLM profiler documentation: [vLLM Profiling Guide](https://docs.vllm.ai/en/latest/dev/profiling.html)

0 commit comments

Comments
 (0)