Skip to content

[Profiler] Add Nsight Systems support for serving#1098

Open
ahengljh wants to merge 14 commits intovllm-project:mainfrom
ahengljh:enable_nsight_profiling
Open

[Profiler] Add Nsight Systems support for serving#1098
ahengljh wants to merge 14 commits intovllm-project:mainfrom
ahengljh:enable_nsight_profiling

Conversation

@ahengljh
Copy link

@ahengljh ahengljh commented Jan 30, 2026

Summary

Related to #677

Follow vLLM's profiler pattern for diffusion workers — use CudaProfilerWrapper and TorchProfilerWrapper from vLLM instead of custom implementation.

How It Works

Diffusion workers now use the same profiler infrastructure as vLLM's LLM workers:

  • VLLM_TORCH_CUDA_PROFILE=1 → uses CudaProfilerWrapper for nsys integration
  • VLLM_TORCH_PROFILER_DIR=./profiles → uses TorchProfilerWrapper for detailed traces

Nsys usage:

export VLLM_TORCH_CUDA_PROFILE=1

nsys profile \
  --capture-range=cudaProfilerApi \
  --capture-range-end=repeat \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  -o diffusion_trace \
  python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ...

Files Changed

File Change
vllm_omni/diffusion/worker/diffusion_worker.py Use vLLM's profiler wrappers based on profiler_config
docs/contributing/profiling.md Updated nsys usage with VLLM_TORCH_CUDA_PROFILE=1

Test Results

@ahengljh ahengljh changed the title [Profiler] Add Nsight Systems support for online serving [Profiler] Add Nsight Systems support for serving Jan 30, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b23aa54006

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 737 to 741
if task_type == OmniStageTaskType.PROFILER_START:
# Signal nsys to begin capturing (no-op if not under nsys)
try:
torch.cuda.profiler.start()
logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start CUDA profiler inside diffusion worker processes

This torch.cuda.profiler.start() call runs only in the stage worker process. For diffusion, actual GPU kernels execute in subprocesses spawned by the diffusion executor (e.g., MultiprocDiffusionExecutorWorkerProc), and those workers never call cudaProfilerStart. With --capture-range=cudaProfilerApi, nsys opens capture ranges per process, so the child processes doing the CUDA work stay closed and the nsys report ends up empty for diffusion workloads. Consider invoking torch.cuda.profiler.start()/stop() in DiffusionWorker.start_profile/stop_profile (or via the RPC path) so the capture range opens in the GPU worker processes.

Useful? React with 👍 / 👎.

@ahengljh ahengljh force-pushed the enable_nsight_profiling branch from b23aa54 to 8c2baf3 Compare January 30, 2026 06:00
@ahengljh ahengljh force-pushed the enable_nsight_profiling branch 2 times, most recently from 84ebfcb to 6408db4 Compare January 30, 2026 07:58
@david6666666 david6666666 linked an issue Jan 30, 2026 that may be closed by this pull request
@david6666666
Copy link
Collaborator

@lishunyang12 @ZJY0516 PTAL if free, thx

@hsliuustc0106
Copy link
Collaborator

provide an e2e example please

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 31, 2026

I would recommend splitting this PR into two: one for online serving profiling, and another for the nsys integration.

launched with ``--capture-range=cudaProfilerApi``) records GPU
activity from within this worker process.
"""
if torch.cuda.is_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to check whether it's CUDA every single time adds a lot of noise to the code.

@gcanlin Do you have any suggestion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestion here? I am not sure. or I can also revert these keep them as previous version.

Copy link
Contributor

@gcanlin gcanlin Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For omni model, because we reuse the vLLM's profiler for now, I think we don't need to add anything for supporting cuda profiler? In gpu_worker.py, it has been supported. If we also need it for diffusion, we may refer to vLLM's wrapper implementation. Before this, maybe we should consider unify the profiler between omni models and diffusion models. cc @lishunyang12

class Worker(WorkerBase):
        # Torch/CUDA profiler. Enabled and configured through profiler_config.
        self.profiler: Any | None = None
        profiler_config = vllm_config.profiler_config
        if profiler_config.profiler == "torch":
            worker_name = f"{vllm_config.instance_id}-rank-{self.rank}"
            self.profiler = TorchProfilerWrapper(
                profiler_config,
                worker_name=worker_name,
                local_rank=self.local_rank,
                activities=["CPU", "CUDA"],
            )
        elif profiler_config.profiler == "cuda":
            self.profiler = CudaProfilerWrapper(profiler_config)
        else:
            self.profiler = None

@gcanlin
Copy link
Contributor

gcanlin commented Feb 3, 2026

BTW, #1136 is implementing online profiling:)

@ahengljh
Copy link
Author

ahengljh commented Feb 3, 2026

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

@lishunyang12
Copy link
Contributor

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

ahengljh and others added 9 commits February 3, 2026 16:10
Add CudaProfiler class and HTTP /start_profile, /stop_profile endpoints
so that nsys can capture GPU-level traces during online serving via the
cudaProfilerApi capture range. Both sync and async stage workers now call
torch.cuda.profiler.start()/stop() alongside the existing torch profiler.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Move /start_profile and /stop_profile from the module-level router
to direct app registration via _register_profiling_routes(), called
after build_app() returns. This ensures the routes exist on the app
regardless of how vllm's build_app() handles router inclusion.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
vllm's build_app() only registers /start_profile and /stop_profile
when profiler_config is explicitly set via CLI.  For the omni server
we always want these endpoints available so nsys profiling can be
triggered via HTTP.  Replace custom route handlers with a simple
unconditional include of vllm's existing profile router.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
- Guard torch.cuda.profiler calls with torch.cuda.is_available() so
  non-CUDA platforms (ROCm, NPU, XPU) get no-ops instead of crashes
- Add torch.cuda.profiler.start()/stop() inside
  DiffusionWorker.start_profile/stop_profile so nsys captures GPU
  activity in the actual diffusion worker subprocesses
- Restructure profiling docs: move nsys online serving section to
  the top as the primary workflow, remove duplicate section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
vLLM's Worker already has built-in CudaProfilerWrapper support that
handles torch.cuda.profiler.start()/stop() in GPU worker processes
when profiler_config.profiler == "cuda". The manual calls in
omni_stage.py were redundant for LLM stages and ran in the wrong
process (orchestrator instead of GPU worker).

CUDA profiler calls remain in DiffusionWorker.start_profile/stop_profile
since diffusion workers don't use vLLM's Worker infrastructure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Remove HTTP /start_profile and /stop_profile endpoint registration
from api_server.py as someone else is handling online profiling.

This PR now focuses purely on nsys integration for diffusion workers:
- CudaProfiler class with platform guards
- torch.cuda.profiler calls in DiffusionWorker.start_profile/stop_profile
- Updated docs for nsys usage with offline diffusion scripts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Per review feedback: vLLM already has CudaProfilerWrapper that should
be reused for unified profiler infrastructure. Remove the separate
CudaProfiler class and keep only the direct torch.cuda.profiler calls
in DiffusionWorker.start_profile/stop_profile as a minimal integration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Use vLLM's CudaProfilerWrapper/TorchProfilerWrapper in DiffusionWorker
instead of custom implementation. This unifies the profiler approach
between omni models and diffusion models.

- Import and use vLLM's profiler wrappers based on profiler_config
- VLLM_TORCH_CUDA_PROFILE=1 enables CudaProfilerWrapper for nsys
- VLLM_TORCH_PROFILER_DIR enables TorchProfilerWrapper for traces
- Remove dependency on CurrentProfiler from diffusion profiler module
- Update docs with vLLM-style nsys usage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
@ahengljh ahengljh force-pushed the enable_nsight_profiling branch from 1621416 to b0c7853 Compare February 3, 2026 08:11
ahengljh and others added 3 commits February 3, 2026 16:24
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
ahengljh and others added 2 commits February 3, 2026 16:57
Signed-off-by: Jinheng Li <ahengljh@gmail.com>
Signed-off-by: Jinheng <leeebucks@gmail.com>
@ahengljh
Copy link
Author

ahengljh commented Feb 3, 2026

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

I‘d love to, but all screensshots uploading are blocked by our network policy of company...

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review it now, sorry for late response.

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for aligning diffusion profiling with vLLM's infrastructure — this is the right direction.

logger.info("Diffusion worker %s: profiler stopped", self.rank)
return None

def execute_model(self, req: OmniDiffusionRequest, od_config: OmniDiffusionConfig) -> DiffusionOutput:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_profile() always returns None, which means DiffusionEngine.stop_profile() never gets any trace paths from workers. The elaborate aggregation logic in the engine becomes dead code.

TorchProfilerWrapper.stop() returns a dict with trace file paths — please return that result instead of discarding it:

def stop_profile(self) -> dict | None:
    if self.profiler is not None:
        return self.profiler.stop()
    return None

"""
if self.profiler is not None:
self.profiler.start()
logger.info("Diffusion worker %s: profiler started", self.rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trace_path_template parameter is accepted but never used — vLLM's wrappers get their paths from profiler_config at init time. This is confusing for callers. Consider removing it entirely or at minimum documenting that it's ignored.

worker_name = f"diffusion-rank-{self.rank}"
self.profiler = TorchProfilerWrapper(
profiler_config,
worker_name=worker_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing activities parameter. vLLM's gpu_worker.py explicitly passes activities=["CPU", "CUDA"]:

self.profiler = TorchProfilerWrapper(
    profiler_config,
    worker_name=worker_name,
    local_rank=self.local_rank,
    activities=["CPU", "CUDA"],  # <-- add this
)

Without it, the torch profiler may not capture CUDA kernels, which defeats the purpose of nsys integration.

profiler_context = (
self.profiler.annotate_context_manager("diffusion_forward") if self.profiler is not None else nullcontext()
)
with profiler_context:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of annotate_context_manager and step() — this follows vLLM's pattern and gives clean trace segmentation per forward pass.

output_files["traces"].append(trace_path)
elif isinstance(trace_path, list):
output_files["traces"].extend(trace_path)
successful_traces = len(output_files["traces"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since workers always return None right now, the entire for rank, res in enumerate(results) loop body is effectively dead code (the if res is None: continue skips everything). This will become useful after fixing the worker's stop_profile() to return the wrapper's result.

trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}"
stage_engine.start_profile(trace_filename=trace_filename)
logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id)
profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment # Sync call is safe here was left behind, but now this function is named handle_profiler_task_async. The comment is stale/misleading.

@lishunyang12
Copy link
Contributor

@ahengljh Hey, aligning the diffusion profiler with vLLM's CudaProfilerWrapper and TorchProfilerWrapper is the right approach — makes nsys and torch profiling work consistently across LLM and diffusion workers. Any blockers on getting this merged?

@hsliuustc0106
Copy link
Collaborator

resolve conflicts

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Nsight Systems Profiler Support

7 participants