update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by xuechendi · Pull Request #1526 · vllm-project/vllm-omni

xuechendi · 2026-02-26T19:10:07Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#1488 enabled cpu_offloading to all devices
And follow up with this PR to generalize GpuMemoryMonitor as DeviceMemoryMonitor for all platform

GpuMemoryMonitor uses Cuda pytorch apis only.
This PR aims to update GpuMemoryMonitor as DeviceMemoryMonitor, then different platform can use their pytorch non-cuda apis.

Test Plan

tested on XPU with

pytest ../tests/e2e/offline_inference/test_diffusion_cpu_offload.py
pytest ../tests/e2e/offline_inference/test_diffusion_layerwise_offload.py -s -v

to confirm the DeviceMemoryMonitor affectiveness.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please pasting the results comparison before and after, or e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 396bf17a2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

tests/utils.py

xuechendi · 2026-02-26T19:53:27Z

@gcanlin @Yikun, for UT, I used some ·torch.accelerator.XXX· api according to our vLLM side discussion
vllm-project/vllm#30679 (comment)

tests/utils.py

gcanlin · 2026-02-27T00:59:53Z

I'm not sure whether we should introduce torch.accelerator before vLLM. May we should wait upstream #30679 done. cc @jikunshang

As a workaround, we use current_omni_platform to dispatch for torch.cuda, torch.npu and torch.xpu. If necessary, could create a interface in platforms/interface.py.

xuechendi · 2026-02-27T01:03:36Z

I'm not sure whether we should introduce torch.accelerator before vLLM. May we should wait upstream #30679 done. cc @jikunshang

pytorch.accelerator support APIs for pt2.9 are listed here: https://docs.pytorch.org/docs/2.9/accelerator.html
but yes, if it is not supported for NPU, I can change to current_omni_platform

hsliuustc0106

Summary

This PR refactors GPUMemoryMonitor to DeviceMemoryMonitor to support multiple hardware platforms (GPU, NPU, XPU) instead of being CUDA-specific. The changes align with issue #1488's goal of enabling CPU offloading across all devices.

Key Issues & Recommendations

1. Inconsistent use of `torch.accelerator` API

The PR mixes two approaches:

Uses torch.accelerator APIs in test files (e.g., torch.accelerator.current_device_index())
Uses current_omni_platform APIs in tests/utils.py

Problem: As noted in the PR comments, torch.accelerator support may not be available for all platforms (particularly NPU) until PyTorch 2.9. The upstream vLLM discussion (issue #30679) is still ongoing.

Recommendation: Be consistent and use current_omni_platform throughout for now. Replace:

torch.accelerator.current_device_index() → current_omni_platform.current_device_index() (or similar)
torch.accelerator.reset_peak_memory_stats() → platform-specific calls

2. Incomplete migration in `test_zimage_parallelism.py`

Lines 96-97 still use CUDA-specific APIs:

torch.cuda.empty_cache()
device_index = torch.cuda.current_device()

Recommendation: Update to use current_omni_platform like the other test files.

3. Missing platform method verification

The code assumes current_omni_platform has mem_get_info() and device() methods, but these aren't verified to exist across all platforms.

Recommendation: Check the platform interface definition to ensure these methods are implemented for NPU/XPU, or add fallback logic.

4. Factory pattern implementation

The DeviceMemoryMonitor.instantiate() factory method is good, but the base class still tries to use current_omni_platform APIs that may not work uniformly.

Recommendation: Consider making the base class abstract or ensuring it only handles CUDA/GPU cases, with NPU/XPU subclasses handling their specific implementations.

5. Test coverage

The PR only tests on XPU. NPU and other platforms should also be tested to ensure the refactoring works correctly.

Minor Issues

The torch.Generator device type change is correct but ensure current_omni_platform.device_type returns the right string format
Consider adding error handling in the monitor loop for platform-specific API failures

The core refactoring approach is sound, but the inconsistent API usage needs to be resolved before merging.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

hsliuustc0106

Summary

This PR refactors GPUMemoryMonitor to DeviceMemoryMonitor to support multiple hardware platforms (GPU, NPU, XPU). The core approach is sound, but there are several issues that need to be addressed.

Pros:

Good use of factory pattern with instantiate() method
Proper subclassing for NPU and XPU specific implementations
Aligns with the goal of platform-agnostic CPU offloading

Cons:

Inconsistent API usage between torch.accelerator and current_omni_platform
Incomplete migration in test_zimage_parallelism.py
Missing verification that platform methods exist across all platforms

Recommendation: Request changes to address the API consistency issues before merging.

tests/e2e/offline_inference/test_diffusion_cpu_offload.py

tests/e2e/offline_inference/test_diffusion_layerwise_offload.py

tests/e2e/offline_inference/test_zimage_parallelism.py

hsliuustc0106 · 2026-02-27T01:33:00Z

tests/utils.py

+        if current_omni_platform.is_npu():
+            return NPUMemoryMonitor(**kwargs)
+        elif current_omni_platform.is_xpu():
+            return XPUMemoryMonitor(**kwargs)


Question: Platform method availability

Does current_omni_platform have mem_get_info() and device() methods implemented for all platforms (GPU, NPU, XPU)? If not, this will fail at runtime for unsupported platforms.

Consider adding error handling or verifying the platform interface includes these methods.

right, seems there is no existing mem_get_info() in current_omni_platform, I have reverted the suggested change from @gcanlin , now it has separate class for NPU and XPU

@gcanlin , quick question, do you suggest to add all these torch apis to current_omni_platform?

Actually, __getattr__ in Platform is helping this. The only downside is that Python LSP can't easily get the api definition. But it can work and help us avoid adding all torch apis.

For torch.accelerator, I think we should wait until vLLM officially completes the integration before we consider doing it, for stability concerns.

https://github.com/vllm-project/vllm/blob/d43048ce0585e6c9178f212aa0b7aeed95eb48df/vllm/platforms/interface.py#L590C1-L604C20

def __getattr__(self, key: str): device = getattr(torch, self.device_type, None) if device is not None and hasattr(device, key): attr = getattr(device, key) # NOTE: `hasattr(device, key)=True` can only avoid AttributeError, # but the value of this attr could be `None`. if attr is not None: return attr logger.warning( "Current platform %s does not have '%s' attribute.", self.device_type, key, ) return None

Oh, I see, Let me check which one works with current_platform. But NPU is now OOT in vLLM side, it also works right?

Oh, I see, Let me check which one works with current_platform. But NPU is now OOT in vLLM side, it also works right?

Yes, it works. Because NPUPlatform in vllm-ascend inherits Platform in vllm. And NPUOmniPlatform inherits NPUPlatform and OmniPlatform. We consider OmniPlatform as an extension of vllm's Platform, which adds some specific api for diffusion models.

Oh, I see, great to know

@gcanlin , I did double check with vLLM side code, the func is only forward attributes from device, not torch.
mem_get_info() is a torch api, so the forward won't help.

In that case, I would think the currently of adding limited apis to current_omni_platform and using DerivedClass for NpuMemoryMonitor and XPUMemoryMonitor might be more clean (not duplicating torch apis to current_omni_platform)?

Could you try the code below? torch.cuda.xxx / torch.npu.xxx should be equivalent with current_omni_platform.xxx.

from vllm_omni.platforms import current_omni_platform import torch print("current_omni_platform.mem_get_info(): {}", current_omni_platform.mem_get_info()) print("torch.cuda.mem_get_info(): {}", torch.cuda.mem_get_info())

current_omni_platform.mem_get_info(): {} (84530692096, 84974239744) torch.cuda.mem_get_info(): {} (84530692096, 84974239744)

oh! it works, that's interesting, ok, let me check rest of apis

hsliuustc0106 · 2026-02-27T01:33:00Z

tests/utils.py


-class GPUMemoryMonitor:
-    """Poll global device memory usage via CUDA APIs."""
+class DeviceMemoryMonitor:


Suggestion: Consider making base class GPU-specific

Since the base DeviceMemoryMonitor class uses current_omni_platform APIs that may not work uniformly across all platforms, consider either:

Making it handle only GPU/CUDA cases explicitly

Making it abstract and requiring all platforms to subclass

Adding platform capability checks before using platform-specific APIs

This would make the design more robust.

vllm_omni/platforms/rocm/platform.py

vllm_omni/platforms/cuda/platform.py

vllm_omni/platforms/rocm/platform.py

vllm_omni/platforms/npu/platform.py

xuechendi · 2026-02-27T04:34:17Z

@gcanlin @hsliuustc0106 @tjtanaa
I updated this PR to use current_omni_platform according to @gcanlin's suggestion.
It is a surprise and interesting to learn that the current_omni_platform works to forward all torch api

Now the change is very simple and clean. Big thanks to @gcanlin

Here is my local examples

Test on XPU

Python 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm_omni.platforms import current_omni_platform
>>> current_omni_platform.mem_get_info()
(24385683456, 24385683456)
>>> current_omni_platform.device(0)
<torch.xpu.device object at 0x7e5549d1f740>
>>> current_omni_platform.current_device()
0
>>> current_omni_platform.reset_peak_memory_stats()
>>>

Test on CUDA

Python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm_omni.platforms import current_omni_platform
>>> current_omni_platform.device(0)
<torch.cuda.device object at 0x7b3eac70a180>
>>> current_omni_platform.mem_get_info()
(186868629504, 191503138816)
>>> current_omni_platform.current_device()
0
>>> current_omni_platform.reset_peak_memory_stats()
>>>

and @gcanlin can confirm from NPU side

And also did a trace back to vLLM interface.py::class Platform::__getattr__ func
Confirmed

device = getattr(torch, self.device_type, None)
if device is not None and hasattr(device, key):
      attr = getattr(device, key)
      # NOTE: `hasattr(device, key)=True` can only avoid AttributeError,
      # but the value of this attr could be `None`.
      if attr is not None:
          return attr
```` forward all torch.XXX apis.

gcanlin

LGTM. Thanks!

lishunyang12

Left a minor note on test_zimage_parallelism.py. The core changes look good.

lishunyang12 · 2026-02-28T03:48:42Z

tests/e2e/offline_inference/test_zimage_parallelism.py

Nit: test_zimage_tensor_parallel_tp2 and test_zimage_vae_patch_parallel_tp2 still have torch.cuda.is_available() / torch.cuda.device_count() in their skip guards. Not a blocker since those tests are explicitly CUDA-only, but worth a follow-up if XPU/NPU should run them too.

Got it, I have updated cuda to current_platform_omni.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-03-02T15:42:22Z

@hsliuustc0106 , may I get a second review on this PR, Thanks.

chatgpt-codex-connector bot reviewed Feb 26, 2026

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

xuechendi mentioned this pull request Feb 26, 2026

[XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda #1488

Merged

5 tasks

xuechendi force-pushed the device_memory_monitor branch from 396bf17 to dcc1bca Compare February 26, 2026 19:24

xuechendi requested a review from hsliuustc0106 as a code owner February 26, 2026 19:24

xuechendi force-pushed the device_memory_monitor branch 2 times, most recently from 6c98042 to e860a12 Compare February 26, 2026 19:51

gcanlin reviewed Feb 27, 2026

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

gcanlin reviewed Feb 27, 2026

View reviewed changes

tests/utils.py Show resolved Hide resolved

xuechendi force-pushed the device_memory_monitor branch from a9ff53f to 6e8e32a Compare February 27, 2026 01:16

hsliuustc0106 reviewed Feb 27, 2026

View reviewed changes

xuechendi added 2 commits February 27, 2026 01:32

enable DeviceMemoryMonitor for all platforms

fdd28c4

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Follow @gcanlin suggestion to use current_omni_platform

e2c4a03

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the device_memory_monitor branch from 6e8e32a to fdd28c4 Compare February 27, 2026 01:32

hsliuustc0106 requested changes Feb 27, 2026

View reviewed changes

xuechendi requested a review from hsliuustc0106 February 27, 2026 03:49