[Profiler] Add Nsight Systems support for serving by ahengljh · Pull Request #1098 · vllm-project/vllm-omni

ahengljh · 2026-01-30T03:42:42Z

Summary

Related to #677

Follow vLLM's profiler pattern for diffusion workers — use CudaProfilerWrapper and TorchProfilerWrapper from vLLM instead of custom implementation.

How It Works

Diffusion workers now use the same profiler infrastructure as vLLM's LLM workers:

VLLM_TORCH_CUDA_PROFILE=1 → uses CudaProfilerWrapper for nsys integration
VLLM_TORCH_PROFILER_DIR=./profiles → uses TorchProfilerWrapper for detailed traces

Nsys usage:

export VLLM_TORCH_CUDA_PROFILE=1

nsys profile \
  --capture-range=cudaProfilerApi \
  --capture-range-end=repeat \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  -o diffusion_trace \
  python image_to_video.py --model Wan-AI/Wan2.2-I2V-A14B-Diffusers ...

Files Changed

File	Change
`vllm_omni/diffusion/worker/diffusion_worker.py`	Use vLLM's profiler wrappers based on `profiler_config`
`docs/contributing/profiling.md`	Updated nsys usage with `VLLM_TORCH_CUDA_PROFILE=1`

Test Results

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b23aa54006

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-30T03:46:44Z

vllm_omni/entrypoints/omni_stage.py

        if task_type == OmniStageTaskType.PROFILER_START:
+            # Signal nsys to begin capturing (no-op if not under nsys)
+            try:
+                torch.cuda.profiler.start()
+                logger.info("[Stage-%s] CUDA profiler started (nsys capture region open)", stage_id)


Start CUDA profiler inside diffusion worker processes

This torch.cuda.profiler.start() call runs only in the stage worker process. For diffusion, actual GPU kernels execute in subprocesses spawned by the diffusion executor (e.g., MultiprocDiffusionExecutor → WorkerProc), and those workers never call cudaProfilerStart. With --capture-range=cudaProfilerApi, nsys opens capture ranges per process, so the child processes doing the CUDA work stay closed and the nsys report ends up empty for diffusion workloads. Consider invoking torch.cuda.profiler.start()/stop() in DiffusionWorker.start_profile/stop_profile (or via the RPC path) so the capture range opens in the GPU worker processes.

Useful? React with 👍 / 👎.

david6666666 · 2026-01-30T08:50:13Z

@lishunyang12 @ZJY0516 PTAL if free, thx

docs/contributing/profiling.md

vllm_omni/diffusion/profiler/cuda_profiler.py

hsliuustc0106 · 2026-01-31T08:32:21Z

provide an e2e example please

ZJY0516 · 2026-01-31T08:34:19Z

I would recommend splitting this PR into two: one for online serving profiling, and another for the nsys integration.

ZJY0516 · 2026-02-03T02:46:18Z

vllm_omni/diffusion/worker/diffusion_worker.py

+        launched with ``--capture-range=cudaProfilerApi``) records GPU
+        activity from within this worker process.
+        """
+        if torch.cuda.is_available():


Having to check whether it's CUDA every single time adds a lot of noise to the code.

@gcanlin Do you have any suggestion?

any suggestion here? I am not sure. or I can also revert these keep them as previous version.

For omni model, because we reuse the vLLM's profiler for now, I think we don't need to add anything for supporting cuda profiler? In gpu_worker.py, it has been supported. If we also need it for diffusion, we may refer to vLLM's wrapper implementation. Before this, maybe we should consider unify the profiler between omni models and diffusion models. cc @lishunyang12

class Worker(WorkerBase): # Torch/CUDA profiler. Enabled and configured through profiler_config. self.profiler: Any | None = None profiler_config = vllm_config.profiler_config if profiler_config.profiler == "torch": worker_name = f"{vllm_config.instance_id}-rank-{self.rank}" self.profiler = TorchProfilerWrapper( profiler_config, worker_name=worker_name, local_rank=self.local_rank, activities=["CPU", "CUDA"], ) elif profiler_config.profiler == "cuda": self.profiler = CudaProfilerWrapper(profiler_config) else: self.profiler = None

gcanlin · 2026-02-03T03:29:57Z

BTW, #1136 is implementing online profiling:)

ahengljh · 2026-02-03T03:33:54Z

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

lishunyang12 · 2026-02-03T05:01:20Z

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

Add CudaProfiler class and HTTP /start_profile, /stop_profile endpoints so that nsys can capture GPU-level traces during online serving via the cudaProfilerApi capture range. Both sync and async stage workers now call torch.cuda.profiler.start()/stop() alongside the existing torch profiler. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Move /start_profile and /stop_profile from the module-level router to direct app registration via _register_profiling_routes(), called after build_app() returns. This ensures the routes exist on the app regardless of how vllm's build_app() handles router inclusion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

vllm's build_app() only registers /start_profile and /stop_profile when profiler_config is explicitly set via CLI. For the omni server we always want these endpoints available so nsys profiling can be triggered via HTTP. Replace custom route handlers with a simple unconditional include of vllm's existing profile router. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

- Guard torch.cuda.profiler calls with torch.cuda.is_available() so non-CUDA platforms (ROCm, NPU, XPU) get no-ops instead of crashes - Add torch.cuda.profiler.start()/stop() inside DiffusionWorker.start_profile/stop_profile so nsys captures GPU activity in the actual diffusion worker subprocesses - Restructure profiling docs: move nsys online serving section to the top as the primary workflow, remove duplicate section Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

vLLM's Worker already has built-in CudaProfilerWrapper support that handles torch.cuda.profiler.start()/stop() in GPU worker processes when profiler_config.profiler == "cuda". The manual calls in omni_stage.py were redundant for LLM stages and ran in the wrong process (orchestrator instead of GPU worker). CUDA profiler calls remain in DiffusionWorker.start_profile/stop_profile since diffusion workers don't use vLLM's Worker infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Remove HTTP /start_profile and /stop_profile endpoint registration from api_server.py as someone else is handling online profiling. This PR now focuses purely on nsys integration for diffusion workers: - CudaProfiler class with platform guards - torch.cuda.profiler calls in DiffusionWorker.start_profile/stop_profile - Updated docs for nsys usage with offline diffusion scripts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Per review feedback: vLLM already has CudaProfilerWrapper that should be reused for unified profiler infrastructure. Remove the separate CudaProfiler class and keep only the direct torch.cuda.profiler calls in DiffusionWorker.start_profile/stop_profile as a minimal integration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Use vLLM's CudaProfilerWrapper/TorchProfilerWrapper in DiffusionWorker instead of custom implementation. This unifies the profiler approach between omni models and diffusion models. - Import and use vLLM's profiler wrappers based on profiler_config - VLLM_TORCH_CUDA_PROFILE=1 enables CudaProfilerWrapper for nsys - VLLM_TORCH_PROFILER_DIR enables TorchProfilerWrapper for traces - Remove dependency on CurrentProfiler from diffusion profiler module - Update docs with vLLM-style nsys usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Signed-off-by: Jinheng <leeebucks@gmail.com>

ahengljh · 2026-02-03T09:03:57Z

BTW, #1136 is implementing online profiling:)

I see, I'll remove online profiling part.

Can you provide test results, maybe a trace graph attached to description would be good.

I‘d love to, but all screensshots uploading are blocked by our network policy of company...

lishunyang12

I will review it now, sorry for late response.

lishunyang12

Thanks for aligning diffusion profiling with vLLM's infrastructure — this is the right direction.

lishunyang12 · 2026-02-12T09:28:09Z

vllm_omni/diffusion/worker/diffusion_worker.py

+            logger.info("Diffusion worker %s: profiler stopped", self.rank)
+        return None

    def execute_model(self, req: OmniDiffusionRequest, od_config: OmniDiffusionConfig) -> DiffusionOutput:


stop_profile() always returns None, which means DiffusionEngine.stop_profile() never gets any trace paths from workers. The elaborate aggregation logic in the engine becomes dead code.

TorchProfilerWrapper.stop() returns a dict with trace file paths — please return that result instead of discarding it:

def stop_profile(self) -> dict | None: if self.profiler is not None: return self.profiler.stop() return None

lishunyang12 · 2026-02-12T09:28:09Z

vllm_omni/diffusion/worker/diffusion_worker.py

+        """
+        if self.profiler is not None:
+            self.profiler.start()
+            logger.info("Diffusion worker %s: profiler started", self.rank)


The trace_path_template parameter is accepted but never used — vLLM's wrappers get their paths from profiler_config at init time. This is confusing for callers. Consider removing it entirely or at minimum documenting that it's ignored.

lishunyang12 · 2026-02-12T09:28:09Z

vllm_omni/diffusion/worker/diffusion_worker.py

+            worker_name = f"diffusion-rank-{self.rank}"
+            self.profiler = TorchProfilerWrapper(
+                profiler_config,
+                worker_name=worker_name,


Missing activities parameter. vLLM's gpu_worker.py explicitly passes activities=["CPU", "CUDA"]:

self.profiler = TorchProfilerWrapper( profiler_config, worker_name=worker_name, local_rank=self.local_rank, activities=["CPU", "CUDA"], # <-- add this )

Without it, the torch profiler may not capture CUDA kernels, which defeats the purpose of nsys integration.

lishunyang12 · 2026-02-12T09:28:09Z

vllm_omni/diffusion/worker/diffusion_worker.py

+        profiler_context = (
+            self.profiler.annotate_context_manager("diffusion_forward") if self.profiler is not None else nullcontext()
+        )
+        with profiler_context:


Good use of annotate_context_manager and step() — this follows vLLM's pattern and gives clean trace segmentation per forward pass.

lishunyang12 · 2026-02-12T09:28:09Z

vllm_omni/diffusion/diffusion_engine.py

+                    output_files["traces"].append(trace_path)
+                elif isinstance(trace_path, list):
+                    output_files["traces"].extend(trace_path)
+                successful_traces = len(output_files["traces"])


Since workers always return None right now, the entire for rank, res in enumerate(results) loop body is effectively dead code (the if res is None: continue skips everything). This will become useful after fixing the worker's stop_profile() to return the wrapper's result.

lishunyang12 · 2026-02-12T09:28:09Z

vllm_omni/entrypoints/omni_stage.py

-                    trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}"
-                    stage_engine.start_profile(trace_filename=trace_filename)
-                    logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id)
+                    profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR")


nit: The comment # Sync call is safe here was left behind, but now this function is named handle_profiler_task_async. The comment is stale/misleading.

lishunyang12 · 2026-02-21T08:00:29Z

@ahengljh Hey, aligning the diffusion profiler with vLLM's CudaProfilerWrapper and TorchProfilerWrapper is the right approach — makes nsys and torch profiling work consistently across LLM and diffusion workers. Any blockers on getting this merged?

hsliuustc0106 · 2026-02-25T16:11:26Z

resolve conflicts

ahengljh requested a review from hsliuustc0106 as a code owner January 30, 2026 03:42

ahengljh changed the title ~~[Profiler] Add Nsight Systems support for online serving~~ [Profiler] Add Nsight Systems support for serving Jan 30, 2026

chatgpt-codex-connector bot reviewed Jan 30, 2026

View reviewed changes

ahengljh force-pushed the enable_nsight_profiling branch from b23aa54 to 8c2baf3 Compare January 30, 2026 06:00

lishunyang12 mentioned this pull request Jan 30, 2026

[RFC] Q1 Diffusion Models Quantization Support #1057

Open

17 tasks

ahengljh force-pushed the enable_nsight_profiling branch 2 times, most recently from 84ebfcb to 6408db4 Compare January 30, 2026 07:58

david6666666 linked an issue Jan 30, 2026 that may be closed by this pull request

[Feature]: Nsight Systems Profiler Support #1064

Open

david6666666 mentioned this pull request Jan 30, 2026

[Feature]: Nsight Systems Profiler Support #1064

Open

hsliuustc0106 reviewed Jan 31, 2026

View reviewed changes

docs/contributing/profiling.md Outdated Show resolved Hide resolved

vllm_omni/diffusion/profiler/cuda_profiler.py Outdated Show resolved Hide resolved

ZJY0516 reviewed Feb 3, 2026

View reviewed changes

ahengljh and others added 9 commits February 3, 2026 16:10

Align diffusion profiling with vLLM

b0c7853

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

ahengljh force-pushed the enable_nsight_profiling branch from 1621416 to b0c7853 Compare February 3, 2026 08:11

ahengljh and others added 3 commits February 3, 2026 16:24

Fix pre-commit formatting

e518553

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Fix import spacing

4d5104d

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Merge branch 'main' into enable_nsight_profiling

076924c

ahengljh and others added 2 commits February 3, 2026 16:57

Resolve diffusion profiling timeout

27805c2

Signed-off-by: Jinheng Li <ahengljh@gmail.com>

Merge branch 'main' into enable_nsight_profiling

8a1e42d

Signed-off-by: Jinheng <leeebucks@gmail.com>

hsliuustc0106 mentioned this pull request Feb 3, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

41 tasks

Gaohan123 added this to the v0.16.0 milestone Feb 10, 2026

lishunyang12 mentioned this pull request Feb 12, 2026

[Feature] Unified Profiler with Online Serving and Stage-Aware Endpoints #1123

Closed

13 tasks

lishunyang12 suggested changes Feb 12, 2026

View reviewed changes

lishunyang12 mentioned this pull request Feb 12, 2026

[Profiler] Support online profiling #1136

Merged

5 tasks

lishunyang12 suggested changes Feb 12, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 26, 2026

Conversation

ahengljh commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How It Works

Files Changed

Test Results

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

david6666666 commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Jan 31, 2026

Uh oh!

ZJY0516 commented Jan 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Feb 3, 2026

Uh oh!

ahengljh commented Feb 3, 2026

Uh oh!

lishunyang12 commented Feb 3, 2026

Uh oh!

ahengljh commented Feb 3, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Feb 21, 2026

Uh oh!

hsliuustc0106 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ahengljh commented Jan 30, 2026 •

edited

Loading

gcanlin Feb 3, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

lishunyang12 left a comment •

edited

Loading