Skip to content

update GpuMemoryMonitor to DeviceMemoryMonitor for all HW#1526

Open
xuechendi wants to merge 4 commits intovllm-project:mainfrom
xuechendi:device_memory_monitor
Open

update GpuMemoryMonitor to DeviceMemoryMonitor for all HW#1526
xuechendi wants to merge 4 commits intovllm-project:mainfrom
xuechendi:device_memory_monitor

Conversation

@xuechendi
Copy link
Contributor

@xuechendi xuechendi commented Feb 26, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#1488 enabled cpu_offloading to all devices
And follow up with this PR to generalize GpuMemoryMonitor as DeviceMemoryMonitor for all platform

GpuMemoryMonitor uses Cuda pytorch apis only.
This PR aims to update GpuMemoryMonitor as DeviceMemoryMonitor, then different platform can use their pytorch non-cuda apis.

Test Plan

tested on XPU with

pytest ../tests/e2e/offline_inference/test_diffusion_cpu_offload.py
pytest ../tests/e2e/offline_inference/test_diffusion_layerwise_offload.py -s -v

to confirm the DeviceMemoryMonitor affectiveness.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 396bf17a2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@xuechendi xuechendi force-pushed the device_memory_monitor branch from 396bf17 to dcc1bca Compare February 26, 2026 19:24
@xuechendi xuechendi force-pushed the device_memory_monitor branch 2 times, most recently from 6c98042 to e860a12 Compare February 26, 2026 19:51
@xuechendi
Copy link
Contributor Author

@gcanlin @Yikun, for UT, I used some ·torch.accelerator.XXX· api according to our vLLM side discussion
vllm-project/vllm#30679 (comment)

@gcanlin
Copy link
Contributor

gcanlin commented Feb 27, 2026

I'm not sure whether we should introduce torch.accelerator before vLLM. May we should wait upstream #30679 done. cc @jikunshang

As a workaround, we use current_omni_platform to dispatch for torch.cuda, torch.npu and torch.xpu. If necessary, could create a interface in platforms/interface.py.

@xuechendi
Copy link
Contributor Author

xuechendi commented Feb 27, 2026

I'm not sure whether we should introduce torch.accelerator before vLLM. May we should wait upstream #30679 done. cc @jikunshang

pytorch.accelerator support APIs for pt2.9 are listed here: https://docs.pytorch.org/docs/2.9/accelerator.html
but yes, if it is not supported for NPU, I can change to current_omni_platform

@xuechendi xuechendi force-pushed the device_memory_monitor branch from a9ff53f to 6e8e32a Compare February 27, 2026 01:16
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR refactors GPUMemoryMonitor to DeviceMemoryMonitor to support multiple hardware platforms (GPU, NPU, XPU) instead of being CUDA-specific. The changes align with issue #1488's goal of enabling CPU offloading across all devices.

Key Issues & Recommendations

1. Inconsistent use of torch.accelerator API

The PR mixes two approaches:

  • Uses torch.accelerator APIs in test files (e.g., torch.accelerator.current_device_index())
  • Uses current_omni_platform APIs in tests/utils.py

Problem: As noted in the PR comments, torch.accelerator support may not be available for all platforms (particularly NPU) until PyTorch 2.9. The upstream vLLM discussion (issue #30679) is still ongoing.

Recommendation: Be consistent and use current_omni_platform throughout for now. Replace:

  • torch.accelerator.current_device_index()current_omni_platform.current_device_index() (or similar)
  • torch.accelerator.reset_peak_memory_stats() → platform-specific calls

2. Incomplete migration in test_zimage_parallelism.py

Lines 96-97 still use CUDA-specific APIs:

torch.cuda.empty_cache()
device_index = torch.cuda.current_device()

Recommendation: Update to use current_omni_platform like the other test files.

3. Missing platform method verification

The code assumes current_omni_platform has mem_get_info() and device() methods, but these aren't verified to exist across all platforms.

Recommendation: Check the platform interface definition to ensure these methods are implemented for NPU/XPU, or add fallback logic.

4. Factory pattern implementation

The DeviceMemoryMonitor.instantiate() factory method is good, but the base class still tries to use current_omni_platform APIs that may not work uniformly.

Recommendation: Consider making the base class abstract or ensuring it only handles CUDA/GPU cases, with NPU/XPU subclasses handling their specific implementations.

5. Test coverage

The PR only tests on XPU. NPU and other platforms should also be tested to ensure the refactoring works correctly.

Minor Issues

  • The torch.Generator device type change is correct but ensure current_omni_platform.device_type returns the right string format
  • Consider adding error handling in the monitor loop for platform-specific API failures

The core refactoring approach is sound, but the inconsistent API usage needs to be resolved before merging.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the device_memory_monitor branch from 6e8e32a to fdd28c4 Compare February 27, 2026 01:32
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR refactors GPUMemoryMonitor to DeviceMemoryMonitor to support multiple hardware platforms (GPU, NPU, XPU). The core approach is sound, but there are several issues that need to be addressed.

Pros:

  • Good use of factory pattern with instantiate() method
  • Proper subclassing for NPU and XPU specific implementations
  • Aligns with the goal of platform-agnostic CPU offloading

Cons:

  • Inconsistent API usage between torch.accelerator and current_omni_platform
  • Incomplete migration in test_zimage_parallelism.py
  • Missing verification that platform methods exist across all platforms

Recommendation: Request changes to address the API consistency issues before merging.

tests/utils.py Outdated
if current_omni_platform.is_npu():
return NPUMemoryMonitor(**kwargs)
elif current_omni_platform.is_xpu():
return XPUMemoryMonitor(**kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Platform method availability

Does current_omni_platform have mem_get_info() and device() methods implemented for all platforms (GPU, NPU, XPU)? If not, this will fail at runtime for unsupported platforms.

Consider adding error handling or verifying the platform interface includes these methods.

Copy link
Contributor Author

@xuechendi xuechendi Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, seems there is no existing mem_get_info() in current_omni_platform, I have reverted the suggested change from @gcanlin , now it has separate class for NPU and XPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcanlin , quick question, do you suggest to add all these torch apis to current_omni_platform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, __getattr__ in Platform is helping this. The only downside is that Python LSP can't easily get the api definition. But it can work and help us avoid adding all torch apis.

For torch.accelerator, I think we should wait until vLLM officially completes the integration before we consider doing it, for stability concerns.

https://github.com/vllm-project/vllm/blob/d43048ce0585e6c9178f212aa0b7aeed95eb48df/vllm/platforms/interface.py#L590C1-L604C20

def __getattr__(self, key: str):
        device = getattr(torch, self.device_type, None)
        if device is not None and hasattr(device, key):
            attr = getattr(device, key)
            # NOTE: `hasattr(device, key)=True` can only avoid AttributeError,
            # but the value of this attr could be `None`.
            if attr is not None:
                return attr

        logger.warning(
            "Current platform %s does not have '%s' attribute.",
            self.device_type,
            key,
        )
        return None

Copy link
Contributor Author

@xuechendi xuechendi Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, Let me check which one works with current_platform. But NPU is now OOT in vLLM side, it also works right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, Let me check which one works with current_platform. But NPU is now OOT in vLLM side, it also works right?

Yes, it works. Because NPUPlatform in vllm-ascend inherits Platform in vllm. And NPUOmniPlatform inherits NPUPlatform and OmniPlatform. We consider OmniPlatform as an extension of vllm's Platform, which adds some specific api for diffusion models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, great to know

Copy link
Contributor Author

@xuechendi xuechendi Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcanlin , I did double check with vLLM side code, the func is only forward attributes from device, not torch.
mem_get_info() is a torch api, so the forward won't help.

In that case, I would think the currently of adding limited apis to current_omni_platform and using DerivedClass for NpuMemoryMonitor and XPUMemoryMonitor might be more clean (not duplicating torch apis to current_omni_platform)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try the code below? torch.cuda.xxx / torch.npu.xxx should be equivalent with current_omni_platform.xxx.

from vllm_omni.platforms import current_omni_platform
import torch
print("current_omni_platform.mem_get_info(): {}", current_omni_platform.mem_get_info())
print("torch.cuda.mem_get_info(): {}", torch.cuda.mem_get_info())
current_omni_platform.mem_get_info(): {} (84530692096, 84974239744)
torch.cuda.mem_get_info(): {} (84530692096, 84974239744)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! it works, that's interesting, ok, let me check rest of apis

Image


class GPUMemoryMonitor:
"""Poll global device memory usage via CUDA APIs."""
class DeviceMemoryMonitor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Consider making base class GPU-specific

Since the base DeviceMemoryMonitor class uses current_omni_platform APIs that may not work uniformly across all platforms, consider either:

  1. Making it handle only GPU/CUDA cases explicitly
  2. Making it abstract and requiring all platforms to subclass
  3. Adding platform capability checks before using platform-specific APIs

This would make the design more robust.

@xuechendi xuechendi force-pushed the device_memory_monitor branch from 2b3f7ed to e2c4a03 Compare February 27, 2026 04:19
@xuechendi
Copy link
Contributor Author

xuechendi commented Feb 27, 2026

@gcanlin @hsliuustc0106 @tjtanaa
I updated this PR to use current_omni_platform according to @gcanlin's suggestion.
It is a surprise and interesting to learn that the current_omni_platform works to forward all torch api

Now the change is very simple and clean. Big thanks to @gcanlin

Here is my local examples

Test on XPU

Python 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm_omni.platforms import current_omni_platform
>>> current_omni_platform.mem_get_info()
(24385683456, 24385683456)
>>> current_omni_platform.device(0)
<torch.xpu.device object at 0x7e5549d1f740>
>>> current_omni_platform.current_device()
0
>>> current_omni_platform.reset_peak_memory_stats()
>>> 

Test on CUDA

Python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm_omni.platforms import current_omni_platform
>>> current_omni_platform.device(0)
<torch.cuda.device object at 0x7b3eac70a180>
>>> current_omni_platform.mem_get_info()
(186868629504, 191503138816)
>>> current_omni_platform.current_device()
0
>>> current_omni_platform.reset_peak_memory_stats()
>>> 

and @gcanlin can confirm from NPU side

And also did a trace back to vLLM interface.py::class Platform::__getattr__ func
Confirmed

device = getattr(torch, self.device_type, None)
if device is not None and hasattr(device, key):
      attr = getattr(device, key)
      # NOTE: `hasattr(device, key)=True` can only avoid AttributeError,
      # but the value of this attr could be `None`.
      if attr is not None:
          return attr
```` forward all torch.XXX apis.

Copy link
Contributor

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor note on test_zimage_parallelism.py. The core changes look good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: test_zimage_tensor_parallel_tp2 and test_zimage_vae_patch_parallel_tp2 still have torch.cuda.is_available() / torch.cuda.device_count() in their skip guards. Not a blocker since those tests are explicitly CUDA-only, but worth a follow-up if XPU/NPU should run them too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I have updated cuda to current_platform_omni.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the device_memory_monitor branch from df95f24 to 5bb122e Compare February 28, 2026 04:16
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi
Copy link
Contributor Author

@hsliuustc0106 , may I get a second review on this PR, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants