update GpuMemoryMonitor to DeviceMemoryMonitor for all HW#1526
update GpuMemoryMonitor to DeviceMemoryMonitor for all HW#1526xuechendi wants to merge 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 396bf17a2d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
396bf17 to
dcc1bca
Compare
6c98042 to
e860a12
Compare
|
@gcanlin @Yikun, for UT, I used some ·torch.accelerator.XXX· api according to our vLLM side discussion |
|
I'm not sure whether we should introduce As a workaround, we use |
|
a9ff53f to
6e8e32a
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
Summary
This PR refactors GPUMemoryMonitor to DeviceMemoryMonitor to support multiple hardware platforms (GPU, NPU, XPU) instead of being CUDA-specific. The changes align with issue #1488's goal of enabling CPU offloading across all devices.
Key Issues & Recommendations
1. Inconsistent use of torch.accelerator API
The PR mixes two approaches:
- Uses
torch.acceleratorAPIs in test files (e.g.,torch.accelerator.current_device_index()) - Uses
current_omni_platformAPIs intests/utils.py
Problem: As noted in the PR comments, torch.accelerator support may not be available for all platforms (particularly NPU) until PyTorch 2.9. The upstream vLLM discussion (issue #30679) is still ongoing.
Recommendation: Be consistent and use current_omni_platform throughout for now. Replace:
torch.accelerator.current_device_index()→current_omni_platform.current_device_index()(or similar)torch.accelerator.reset_peak_memory_stats()→ platform-specific calls
2. Incomplete migration in test_zimage_parallelism.py
Lines 96-97 still use CUDA-specific APIs:
torch.cuda.empty_cache()
device_index = torch.cuda.current_device()Recommendation: Update to use current_omni_platform like the other test files.
3. Missing platform method verification
The code assumes current_omni_platform has mem_get_info() and device() methods, but these aren't verified to exist across all platforms.
Recommendation: Check the platform interface definition to ensure these methods are implemented for NPU/XPU, or add fallback logic.
4. Factory pattern implementation
The DeviceMemoryMonitor.instantiate() factory method is good, but the base class still tries to use current_omni_platform APIs that may not work uniformly.
Recommendation: Consider making the base class abstract or ensuring it only handles CUDA/GPU cases, with NPU/XPU subclasses handling their specific implementations.
5. Test coverage
The PR only tests on XPU. NPU and other platforms should also be tested to ensure the refactoring works correctly.
Minor Issues
- The
torch.Generatordevice type change is correct but ensurecurrent_omni_platform.device_typereturns the right string format - Consider adding error handling in the monitor loop for platform-specific API failures
The core refactoring approach is sound, but the inconsistent API usage needs to be resolved before merging.
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
6e8e32a to
fdd28c4
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
Summary
This PR refactors GPUMemoryMonitor to DeviceMemoryMonitor to support multiple hardware platforms (GPU, NPU, XPU). The core approach is sound, but there are several issues that need to be addressed.
Pros:
- Good use of factory pattern with
instantiate()method - Proper subclassing for NPU and XPU specific implementations
- Aligns with the goal of platform-agnostic CPU offloading
Cons:
- Inconsistent API usage between
torch.acceleratorandcurrent_omni_platform - Incomplete migration in
test_zimage_parallelism.py - Missing verification that platform methods exist across all platforms
Recommendation: Request changes to address the API consistency issues before merging.
tests/utils.py
Outdated
| if current_omni_platform.is_npu(): | ||
| return NPUMemoryMonitor(**kwargs) | ||
| elif current_omni_platform.is_xpu(): | ||
| return XPUMemoryMonitor(**kwargs) |
There was a problem hiding this comment.
Question: Platform method availability
Does current_omni_platform have mem_get_info() and device() methods implemented for all platforms (GPU, NPU, XPU)? If not, this will fail at runtime for unsupported platforms.
Consider adding error handling or verifying the platform interface includes these methods.
There was a problem hiding this comment.
right, seems there is no existing mem_get_info() in current_omni_platform, I have reverted the suggested change from @gcanlin , now it has separate class for NPU and XPU
There was a problem hiding this comment.
@gcanlin , quick question, do you suggest to add all these torch apis to current_omni_platform?
There was a problem hiding this comment.
Actually, __getattr__ in Platform is helping this. The only downside is that Python LSP can't easily get the api definition. But it can work and help us avoid adding all torch apis.
For torch.accelerator, I think we should wait until vLLM officially completes the integration before we consider doing it, for stability concerns.
def __getattr__(self, key: str):
device = getattr(torch, self.device_type, None)
if device is not None and hasattr(device, key):
attr = getattr(device, key)
# NOTE: `hasattr(device, key)=True` can only avoid AttributeError,
# but the value of this attr could be `None`.
if attr is not None:
return attr
logger.warning(
"Current platform %s does not have '%s' attribute.",
self.device_type,
key,
)
return NoneThere was a problem hiding this comment.
Oh, I see, Let me check which one works with current_platform. But NPU is now OOT in vLLM side, it also works right?
There was a problem hiding this comment.
Oh, I see, Let me check which one works with current_platform. But NPU is now OOT in vLLM side, it also works right?
Yes, it works. Because NPUPlatform in vllm-ascend inherits Platform in vllm. And NPUOmniPlatform inherits NPUPlatform and OmniPlatform. We consider OmniPlatform as an extension of vllm's Platform, which adds some specific api for diffusion models.
There was a problem hiding this comment.
Oh, I see, great to know
There was a problem hiding this comment.
@gcanlin , I did double check with vLLM side code, the func is only forward attributes from device, not torch.
mem_get_info() is a torch api, so the forward won't help.
In that case, I would think the currently of adding limited apis to current_omni_platform and using DerivedClass for NpuMemoryMonitor and XPUMemoryMonitor might be more clean (not duplicating torch apis to current_omni_platform)?
There was a problem hiding this comment.
Could you try the code below? torch.cuda.xxx / torch.npu.xxx should be equivalent with current_omni_platform.xxx.
from vllm_omni.platforms import current_omni_platform
import torch
print("current_omni_platform.mem_get_info(): {}", current_omni_platform.mem_get_info())
print("torch.cuda.mem_get_info(): {}", torch.cuda.mem_get_info())current_omni_platform.mem_get_info(): {} (84530692096, 84974239744)
torch.cuda.mem_get_info(): {} (84530692096, 84974239744)
|
|
||
| class GPUMemoryMonitor: | ||
| """Poll global device memory usage via CUDA APIs.""" | ||
| class DeviceMemoryMonitor: |
There was a problem hiding this comment.
Suggestion: Consider making base class GPU-specific
Since the base DeviceMemoryMonitor class uses current_omni_platform APIs that may not work uniformly across all platforms, consider either:
- Making it handle only GPU/CUDA cases explicitly
- Making it abstract and requiring all platforms to subclass
- Adding platform capability checks before using platform-specific APIs
This would make the design more robust.
2b3f7ed to
e2c4a03
Compare
|
@gcanlin @hsliuustc0106 @tjtanaa Now the change is very simple and clean. Big thanks to @gcanlin Here is my local examples Test on XPU Test on CUDA and @gcanlin can confirm from NPU side And also did a trace back to vLLM |
lishunyang12
left a comment
There was a problem hiding this comment.
Left a minor note on test_zimage_parallelism.py. The core changes look good.
There was a problem hiding this comment.
Nit: test_zimage_tensor_parallel_tp2 and test_zimage_vae_patch_parallel_tp2 still have torch.cuda.is_available() / torch.cuda.device_count() in their skip guards. Not a blocker since those tests are explicitly CUDA-only, but worth a follow-up if XPU/NPU should run them too.
There was a problem hiding this comment.
Got it, I have updated cuda to current_platform_omni.
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
df95f24 to
5bb122e
Compare
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
@hsliuustc0106 , may I get a second review on this PR, Thanks. |

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
#1488 enabled cpu_offloading to all devices
And follow up with this PR to generalize
GpuMemoryMonitorasDeviceMemoryMonitorfor all platformGpuMemoryMonitoruses Cuda pytorch apis only.This PR aims to update GpuMemoryMonitor as DeviceMemoryMonitor, then different platform can use their pytorch non-cuda apis.
Test Plan
tested on XPU with
to confirm the DeviceMemoryMonitor affectiveness.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)