[Profiler] Add Nsight Systems support for serving#1098
[Profiler] Add Nsight Systems support for serving#1098ahengljh wants to merge 14 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b23aa54006
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/entrypoints/omni_stage.py
Outdated
| if task_type == OmniStageTaskType.PROFILER_START: | ||
| # Signal nsys to begin capturing (no-op if not under nsys) | ||
| try: | ||
| torch.cuda.profiler.start() | ||
| logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id) |
There was a problem hiding this comment.
Start CUDA profiler inside diffusion worker processes
This torch.cuda.profiler.start() call runs only in the stage worker process. For diffusion, actual GPU kernels execute in subprocesses spawned by the diffusion executor (e.g., MultiprocDiffusionExecutor → WorkerProc), and those workers never call cudaProfilerStart. With --capture-range=cudaProfilerApi, nsys opens capture ranges per process, so the child processes doing the CUDA work stay closed and the nsys report ends up empty for diffusion workloads. Consider invoking torch.cuda.profiler.start()/stop() in DiffusionWorker.start_profile/stop_profile (or via the RPC path) so the capture range opens in the GPU worker processes.
Useful? React with 👍 / 👎.
b23aa54 to
8c2baf3
Compare
84ebfcb to
6408db4
Compare
|
@lishunyang12 @ZJY0516 PTAL if free, thx |
|
provide an e2e example please |
|
I would recommend splitting this PR into two: one for online serving profiling, and another for the nsys integration. |
| launched with ``--capture-range=cudaProfilerApi``) records GPU | ||
| activity from within this worker process. | ||
| """ | ||
| if torch.cuda.is_available(): |
There was a problem hiding this comment.
Having to check whether it's CUDA every single time adds a lot of noise to the code.
@gcanlin Do you have any suggestion?
There was a problem hiding this comment.
any suggestion here? I am not sure. or I can also revert these keep them as previous version.
There was a problem hiding this comment.
For omni model, because we reuse the vLLM's profiler for now, I think we don't need to add anything for supporting cuda profiler? In gpu_worker.py, it has been supported. If we also need it for diffusion, we may refer to vLLM's wrapper implementation. Before this, maybe we should consider unify the profiler between omni models and diffusion models. cc @lishunyang12
class Worker(WorkerBase):
# Torch/CUDA profiler. Enabled and configured through profiler_config.
self.profiler: Any | None = None
profiler_config = vllm_config.profiler_config
if profiler_config.profiler == "torch":
worker_name = f"{vllm_config.instance_id}-rank-{self.rank}"
self.profiler = TorchProfilerWrapper(
profiler_config,
worker_name=worker_name,
local_rank=self.local_rank,
activities=["CPU", "CUDA"],
)
elif profiler_config.profiler == "cuda":
self.profiler = CudaProfilerWrapper(profiler_config)
else:
self.profiler = None|
BTW, #1136 is implementing online profiling:) |
I see, I'll remove online profiling part. |
Can you provide test results, maybe a trace graph attached to description would be good. |
Add CudaProfiler class and HTTP /start_profile, /stop_profile endpoints so that nsys can capture GPU-level traces during online serving via the cudaProfilerApi capture range. Both sync and async stage workers now call torch.cuda.profiler.start()/stop() alongside the existing torch profiler. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Move /start_profile and /stop_profile from the module-level router to direct app registration via _register_profiling_routes(), called after build_app() returns. This ensures the routes exist on the app regardless of how vllm's build_app() handles router inclusion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
vllm's build_app() only registers /start_profile and /stop_profile when profiler_config is explicitly set via CLI. For the omni server we always want these endpoints available so nsys profiling can be triggered via HTTP. Replace custom route handlers with a simple unconditional include of vllm's existing profile router. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
- Guard torch.cuda.profiler calls with torch.cuda.is_available() so non-CUDA platforms (ROCm, NPU, XPU) get no-ops instead of crashes - Add torch.cuda.profiler.start()/stop() inside DiffusionWorker.start_profile/stop_profile so nsys captures GPU activity in the actual diffusion worker subprocesses - Restructure profiling docs: move nsys online serving section to the top as the primary workflow, remove duplicate section Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
vLLM's Worker already has built-in CudaProfilerWrapper support that handles torch.cuda.profiler.start()/stop() in GPU worker processes when profiler_config.profiler == "cuda". The manual calls in omni_stage.py were redundant for LLM stages and ran in the wrong process (orchestrator instead of GPU worker). CUDA profiler calls remain in DiffusionWorker.start_profile/stop_profile since diffusion workers don't use vLLM's Worker infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Remove HTTP /start_profile and /stop_profile endpoint registration from api_server.py as someone else is handling online profiling. This PR now focuses purely on nsys integration for diffusion workers: - CudaProfiler class with platform guards - torch.cuda.profiler calls in DiffusionWorker.start_profile/stop_profile - Updated docs for nsys usage with offline diffusion scripts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Per review feedback: vLLM already has CudaProfilerWrapper that should be reused for unified profiler infrastructure. Remove the separate CudaProfiler class and keep only the direct torch.cuda.profiler calls in DiffusionWorker.start_profile/stop_profile as a minimal integration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Use vLLM's CudaProfilerWrapper/TorchProfilerWrapper in DiffusionWorker instead of custom implementation. This unifies the profiler approach between omni models and diffusion models. - Import and use vLLM's profiler wrappers based on profiler_config - VLLM_TORCH_CUDA_PROFILE=1 enables CudaProfilerWrapper for nsys - VLLM_TORCH_PROFILER_DIR enables TorchProfilerWrapper for traces - Remove dependency on CurrentProfiler from diffusion profiler module - Update docs with vLLM-style nsys usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
1621416 to
b0c7853
Compare
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng <leeebucks@gmail.com>
I‘d love to, but all screensshots uploading are blocked by our network policy of company... |
| logger.info("Diffusion worker %s: profiler stopped", self.rank) | ||
| return None | ||
|
|
||
| def execute_model(self, req: OmniDiffusionRequest, od_config: OmniDiffusionConfig) -> DiffusionOutput: |
There was a problem hiding this comment.
stop_profile() always returns None, which means DiffusionEngine.stop_profile() never gets any trace paths from workers. The elaborate aggregation logic in the engine becomes dead code.
TorchProfilerWrapper.stop() returns a dict with trace file paths — please return that result instead of discarding it:
def stop_profile(self) -> dict | None:
if self.profiler is not None:
return self.profiler.stop()
return None| """ | ||
| if self.profiler is not None: | ||
| self.profiler.start() | ||
| logger.info("Diffusion worker %s: profiler started", self.rank) |
There was a problem hiding this comment.
The trace_path_template parameter is accepted but never used — vLLM's wrappers get their paths from profiler_config at init time. This is confusing for callers. Consider removing it entirely or at minimum documenting that it's ignored.
| worker_name = f"diffusion-rank-{self.rank}" | ||
| self.profiler = TorchProfilerWrapper( | ||
| profiler_config, | ||
| worker_name=worker_name, |
There was a problem hiding this comment.
Missing activities parameter. vLLM's gpu_worker.py explicitly passes activities=["CPU", "CUDA"]:
self.profiler = TorchProfilerWrapper(
profiler_config,
worker_name=worker_name,
local_rank=self.local_rank,
activities=["CPU", "CUDA"], # <-- add this
)Without it, the torch profiler may not capture CUDA kernels, which defeats the purpose of nsys integration.
| profiler_context = ( | ||
| self.profiler.annotate_context_manager("diffusion_forward") if self.profiler is not None else nullcontext() | ||
| ) | ||
| with profiler_context: |
There was a problem hiding this comment.
Good use of annotate_context_manager and step() — this follows vLLM's pattern and gives clean trace segmentation per forward pass.
| output_files["traces"].append(trace_path) | ||
| elif isinstance(trace_path, list): | ||
| output_files["traces"].extend(trace_path) | ||
| successful_traces = len(output_files["traces"]) |
There was a problem hiding this comment.
Since workers always return None right now, the entire for rank, res in enumerate(results) loop body is effectively dead code (the if res is None: continue skips everything). This will become useful after fixing the worker's stop_profile() to return the wrapper's result.
| trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}" | ||
| stage_engine.start_profile(trace_filename=trace_filename) | ||
| logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id) | ||
| profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR") |
There was a problem hiding this comment.
nit: The comment # Sync call is safe here was left behind, but now this function is named handle_profiler_task_async. The comment is stale/misleading.
|
@ahengljh Hey, aligning the diffusion profiler with vLLM's CudaProfilerWrapper and TorchProfilerWrapper is the right approach — makes nsys and torch profiling work consistently across LLM and diffusion workers. Any blockers on getting this merged? |
|
resolve conflicts |
Summary
Related to #677
Follow vLLM's profiler pattern for diffusion workers — use
CudaProfilerWrapperandTorchProfilerWrapperfrom vLLM instead of custom implementation.How It Works
Diffusion workers now use the same profiler infrastructure as vLLM's LLM workers:
VLLM_TORCH_CUDA_PROFILE=1→ usesCudaProfilerWrapperfor nsys integrationVLLM_TORCH_PROFILER_DIR=./profiles→ usesTorchProfilerWrapperfor detailed tracesNsys usage:
export VLLM_TORCH_CUDA_PROFILE=1 nsys profile \ --capture-range=cudaProfilerApi \ --capture-range-end=repeat \ --trace-fork-before-exec=true \ --cuda-graph-trace=node \ -o diffusion_trace \ python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ...Files Changed
vllm_omni/diffusion/worker/diffusion_worker.pyprofiler_configdocs/contributing/profiling.mdVLLM_TORCH_CUDA_PROFILE=1Test Results