Skip to content

Commit b23aa54

Browse files
ahengljhclaude
andcommitted
[Profiler] Add Nsight Systems support for online serving
Add CudaProfiler class and HTTP /start_profile, /stop_profile endpoints so that nsys can capture GPU-level traces during online serving via the cudaProfilerApi capture range. Both sync and async stage workers now call torch.cuda.profiler.start()/stop() alongside the existing torch profiler. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 4ba0cf0 commit b23aa54

File tree

5 files changed

+178
-3
lines changed

5 files changed

+178
-3
lines changed

docs/contributing/profiling.md

Lines changed: 48 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,9 +132,55 @@ python image_to_video.py \
132132
2. **Wan-AI/Wan2.2-I2V-A14B-Diffusers**: [https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
133133

134134
> **Note:**
135-
As of now, asynchronous (online) profiling is not fully supported in vLLM-Omni. While start_profile() and stop_profile() methods exist, they are only reliable in offline inference scripts (e.g., the provided end2end.py examples). Do not use them in server-mode or streaming scenarios—traces may be incomplete or fail to flush.
135+
The PyTorch Profiler (`start_profile()` / `stop_profile()`) is primarily designed for offline inference scripts. For online (server-mode) profiling, use Nsight Systems as described below.
136136

137-
### 4. Analyzing Omni Traces
137+
### 4. Nsight Systems Profiling for Online Serving
138+
139+
NVIDIA Nsight Systems (`nsys`) can capture GPU-level traces while the server is running. The API server exposes `/start_profile` and `/stop_profile` HTTP endpoints that signal nsys via `torch.cuda.profiler.start()` / `stop()`.
140+
141+
**Step 1 — Launch the server under nsys:**
142+
143+
```bash
144+
nsys profile \
145+
--capture-range=cudaProfilerApi \
146+
--capture-range-end=repeat \
147+
--trace-fork-before-exec=true \
148+
--cuda-graph-trace=node \
149+
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
150+
```
151+
152+
`--capture-range=cudaProfilerApi` tells nsys to sit idle until `torch.cuda.profiler.start()` is called in a worker process. `--capture-range-end=repeat` allows multiple start/stop cycles in the same session.
153+
154+
**Step 2 — Start profiling:**
155+
156+
```bash
157+
curl -X POST http://localhost:8091/start_profile
158+
```
159+
160+
**Step 3 — Send requests:**
161+
162+
```bash
163+
curl -X POST http://localhost:8091/v1/chat/completions \
164+
-H "Content-Type: application/json" \
165+
-d '{"model":"Qwen/Qwen2.5-Omni-7B","messages":[{"role":"user","content":"Hello"}]}'
166+
```
167+
168+
**Step 4 — Stop profiling:**
169+
170+
```bash
171+
curl -X POST http://localhost:8091/stop_profile
172+
```
173+
174+
**Step 5 — Shut down the server** (Ctrl+C). nsys writes a `.nsys-rep` file in the current directory.
175+
176+
```bash
177+
ls *.nsys-rep
178+
nsys stats report1.nsys-rep
179+
```
180+
181+
Open the `.nsys-rep` file in the Nsight Systems GUI for a detailed timeline of CUDA kernels, memory operations, and NVTX ranges.
182+
183+
### 5. Analyzing Omni Traces
138184

139185
Output files are saved to your configured ```VLLM_TORCH_PROFILER_DIR```.
140186

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# SPDX-License-Identifier: Apache-2.0
22
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
33

4+
from .cuda_profiler import CudaProfiler
45
from .torch_profiler import TorchProfiler
56

67
# Default profiler – can be changed later via config
78
CurrentProfiler = TorchProfiler
89

9-
__all__ = ["CurrentProfiler", "TorchProfiler"]
10+
__all__ = ["CudaProfiler", "CurrentProfiler", "TorchProfiler"]
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
4+
from contextlib import nullcontext
5+
6+
import torch
7+
from vllm.logger import init_logger
8+
9+
from .base import ProfilerBase
10+
11+
logger = init_logger(__name__)
12+
13+
14+
class CudaProfiler(ProfilerBase):
15+
"""
16+
Lightweight profiler that signals nsys via the CUDA Profiler API.
17+
18+
When the server is launched under ``nsys profile
19+
--capture-range=cudaProfilerApi``, calling ``start()`` /
20+
``stop()`` brackets the region that nsys will capture. No trace
21+
files are written by this class — nsys handles all tracing
22+
externally and produces a ``.nsys-rep`` file on process exit.
23+
"""
24+
25+
_active: bool = False
26+
27+
@classmethod
28+
def start(cls, trace_path_template: str = "") -> str:
29+
"""Start the CUDA profiler range for nsys capture."""
30+
if cls._active:
31+
logger.warning("[Rank %s] CUDA profiler already active", cls._get_rank())
32+
return ""
33+
torch.cuda.profiler.start()
34+
cls._active = True
35+
logger.info("[Rank %s] CUDA profiler started (nsys capture region open)", cls._get_rank())
36+
return ""
37+
38+
@classmethod
39+
def stop(cls) -> str | None:
40+
"""Stop the CUDA profiler range for nsys capture."""
41+
if not cls._active:
42+
return None
43+
torch.cuda.profiler.stop()
44+
cls._active = False
45+
logger.info("[Rank %s] CUDA profiler stopped (nsys capture region closed)", cls._get_rank())
46+
return None
47+
48+
@classmethod
49+
def get_step_context(cls):
50+
"""Return an NVTX range context manager when active, else no-op."""
51+
if cls._active:
52+
return torch.cuda.nvtx.range("step")
53+
return nullcontext()
54+
55+
@classmethod
56+
def is_active(cls) -> bool:
57+
return cls._active

vllm_omni/entrypoints/omni_stage.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -732,7 +732,16 @@ def _stage_worker(
732732

733733
def handle_profiler_task_local(task_type: OmniStageTaskType) -> dict:
734734
"""Handle profiler task locally in the worker process."""
735+
import torch
736+
735737
if task_type == OmniStageTaskType.PROFILER_START:
738+
# Signal nsys to begin capturing (no-op if not under nsys)
739+
try:
740+
torch.cuda.profiler.start()
741+
logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id)
742+
except Exception as e:
743+
logger.warning("[Stage-%s] Failed to start CUDA profiler: %s", stage_id, e)
744+
736745
if stage_type == "diffusion":
737746
try:
738747
profile_dir = _os.environ.get("VLLM_TORCH_PROFILER_DIR", "./profiles")
@@ -751,6 +760,13 @@ def handle_profiler_task_local(task_type: OmniStageTaskType) -> dict:
751760
return {}
752761

753762
elif task_type == OmniStageTaskType.PROFILER_STOP:
763+
# Signal nsys to stop capturing (no-op if not under nsys)
764+
try:
765+
torch.cuda.profiler.stop()
766+
logger.info("[Stage-%s] CUDA profiler stopped (nsys capture region closed)", stage_id)
767+
except Exception as e:
768+
logger.warning("[Stage-%s] Failed to stop CUDA profiler: %s", stage_id, e)
769+
754770
if stage_type == "diffusion":
755771
try:
756772
# CRITICAL: Capture return value
@@ -1285,7 +1301,16 @@ async def _force_log():
12851301

12861302
async def handle_profiler_task_async(task_type: OmniStageTaskType) -> None:
12871303
"""Handle profiler task asynchronously for both LLM and diffusion stages."""
1304+
import torch
1305+
12881306
if task_type == OmniStageTaskType.PROFILER_START:
1307+
# Signal nsys to begin capturing (no-op if not under nsys)
1308+
try:
1309+
torch.cuda.profiler.start()
1310+
logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id)
1311+
except Exception as e:
1312+
logger.warning("[Stage-%s] Failed to start CUDA profiler: %s", stage_id, e)
1313+
12891314
if stage_type == "diffusion":
12901315
try:
12911316
# Sync call is safe here — diffusion profiling is lightweight
@@ -1304,6 +1329,13 @@ async def handle_profiler_task_async(task_type: OmniStageTaskType) -> None:
13041329
logger.warning("[Stage-%s] Failed to start vLLM profiler: %s", stage_id, e)
13051330

13061331
elif task_type == OmniStageTaskType.PROFILER_STOP:
1332+
# Signal nsys to stop capturing (no-op if not under nsys)
1333+
try:
1334+
torch.cuda.profiler.stop()
1335+
logger.info("[Stage-%s] CUDA profiler stopped (nsys capture region closed)", stage_id)
1336+
except Exception as e:
1337+
logger.warning("[Stage-%s] Failed to stop CUDA profiler: %s", stage_id, e)
1338+
13071339
if stage_type == "diffusion":
13081340
try:
13091341
trace_files = stage_engine.stop_profile()

vllm_omni/entrypoints/openai/api_server.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -736,6 +736,45 @@ async def create_chat_completion(request: ChatCompletionRequest, raw_request: Re
736736
return StreamingResponse(content=generator, media_type="text/event-stream")
737737

738738

739+
@router.post("/start_profile")
740+
async def start_profile(raw_request: Request) -> JSONResponse:
741+
"""Start profiling on all stages.
742+
743+
When the server is running under nsys with
744+
``--capture-range=cudaProfilerApi``, this also opens the CUDA
745+
profiler capture region.
746+
"""
747+
engine_client = raw_request.app.state.engine_client
748+
try:
749+
await engine_client.start_profile()
750+
except Exception as e:
751+
logger.exception("Failed to start profile: %s", e)
752+
raise HTTPException(
753+
status_code=HTTPStatus.INTERNAL_SERVER_ERROR.value,
754+
detail=str(e),
755+
) from e
756+
return JSONResponse(content={"status": "ok"})
757+
758+
759+
@router.post("/stop_profile")
760+
async def stop_profile(raw_request: Request) -> JSONResponse:
761+
"""Stop profiling on all stages.
762+
763+
When running under nsys, this closes the CUDA profiler capture
764+
region so nsys finalises the current capture.
765+
"""
766+
engine_client = raw_request.app.state.engine_client
767+
try:
768+
await engine_client.stop_profile()
769+
except Exception as e:
770+
logger.exception("Failed to stop profile: %s", e)
771+
raise HTTPException(
772+
status_code=HTTPStatus.INTERNAL_SERVER_ERROR.value,
773+
detail=str(e),
774+
) from e
775+
return JSONResponse(content={"status": "ok"})
776+
777+
739778
_remove_route_from_router(router, "/v1/audio/speech", {"POST"})
740779

741780

0 commit comments

Comments
 (0)