[XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda by xuechendi · Pull Request #1488 · vllm-project/vllm-omni

xuechendi · 2026-02-25T23:29:57Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Current cpu_offloading is only open to CUDA. However, CPU offloading is also very useful feature due to memory capacity issue such as intel arc B60.

This PR aims to provide a new way to decide if certain device should provide cpu-offloading capability or not.

Verified on XPU
After this PR, I can confirm cpu_offloading is effective to XPU

[Stage-0] INFO 02-25 23:21:58 [diffusers_loader.py:266] Loading weights took 0.99 seconds
[Stage-0] INFO 02-25 23:21:59 [diffusion_model_runner.py:117] Model loading took 0.1617 GiB and 2.941193 seconds
[Stage-0] INFO 02-25 23:21:59 [diffusion_model_runner.py:122] Model runner: Model loaded successfully.
[Stage-0] INFO 02-25 23:21:59 [diffusion_model_runner.py:127]  Enabling offloader backend: ModelLevelOffloadBackend
[Stage-0] DEBUG 02-25 23:22:01 [sequential_backend.py:122] Registered offload hook for Flux2Transformer2DModel
[Stage-0] DEBUG 02-25 23:22:01 [sequential_backend.py:133] Registered offload hook for Qwen3ForCausalLM

[Stage-0] DEBUG 02-25 23:22:01 [sequential_backend.py:78] Swapped: ['Flux2Transformer2DModel'] -> CPU, Qwen3ForCausalLM -> GPU
[Stage-0] DEBUG 02-25 23:22:18 [sequential_backend.py:78] Swapped: ['Qwen3ForCausalLM'] -> CPU, Flux2Transformer2DModel -> GPU

[Stage-0] DEBUG 02-25 23:22:33 [sequential_backend.py:78] Swapped: ['Flux2Transformer2DModel'] -> CPU, Qwen3ForCausalLM -> GPU
[Stage-0] DEBUG 02-25 23:22:36 [sequential_backend.py:78] Swapped: ['Qwen3ForCausalLM'] -> CPU, Flux2Transformer2DModel -> GPU
[Stage-0] DEBUG 02-25 23:22:37 [sequential_backend.py:78] Swapped: ['Qwen3ForCausalLM'] -> CPU, Flux2Transformer2DModel -> GPU
...

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please pasting the results comparison before and after, or e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8ed71998f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/diffusion/offloader/__init__.py

gcanlin

Thanks, I think it will also work on NPU. Let me test it quickly :)

gcanlin · 2026-02-26T02:37:31Z

vllm_omni/diffusion/offloader/__init__.py

+    if not current_omni_platform.supports_cpu_offload() or current_omni_platform.get_device_count() < 1:
+        logger.warning(
+            "Current device: %s does not support CPU offloading. Skipping offloading.",
+            current_omni_platform.get_device_name(),
+        )
        return None


I tested cpu offload on NPU and it can work now. Could you please directly remove this check? This feature should be always adapted on every platform :)

Suggested change

if not current_omni_platform.supports_cpu_offload() or current_omni_platform.get_device_count() < 1:

logger.warning(

"Current device: %s does not support CPU offloading. Skipping offloading.",

current_omni_platform.get_device_name(),

)

return None

@gcanlin @hsliuustc0106 @xuechendi is this regular cpu offloading for layerwise cpu offloading?

@gcanlin @hsliuustc0106 @xuechendi is this regular cpu offloading for layerwise cpu offloading?

This PR is working on enabling the entry of cpu offloading that includes layerwise.

Layerwise cpu offloading is enabled by #1492.

so, why we need to have 2 PRs?

so, why we need to have 2 PRs?

Not really needed. Just open #1492 to enable cpu offload more broadly. @xuechendi Would you mind I directly remove this if condition in my PR, which also enables layerwise offload for other platforms? Feel free to take my PR into this PR.

Got it, I have rebased #1492 to this PR
Meanwhile, I prefer to keep the option for future HW support, so I move the check to base class with default true

hsliuustc0106 · 2026-02-26T02:43:28Z

PR Review: [XPU][NPU][RMC] enable cpu_offloading flag for non_cuda (#1488)

Summary

This PR enables CPU offloading capability for non-CUDA platforms (XPU, NPU, ROCm) by introducing a new platform capability method supports_cpu_offload(). Previously, CPU offloading was hardcoded to only work with CUDA devices.

Code Changes Analysis

Positive aspects:

Clean abstraction: Replaces hardcoded is_cuda() check with a capability-based approach
Minimal changes: Only 21 additions, 2 deletions across 5 files
Consistent implementation: All platform classes get the new method
Better error messaging: Now shows which device doesn't support offloading

Issues identified:

1. Missing base interface declaration (Critical)

The supports_cpu_offload() method is added to all platform implementations but is NOT declared in the OmniPlatform base class at vllm_omni/platforms/interface.py:20-100. This breaks the interface contract.

Recommendation: Add to interface.py:

@classmethod
def supports_cpu_offload(cls) -> bool:
    """Check if the platform supports CPU offloading for models."""
    raise NotImplementedError

2. NPU platform returns False without justification

In vllm_omni/platforms/npu/platform.py:91, NPU returns False while XPU, CUDA, and ROCm return True. However:

The PR description states the author verified it works on XPU
A reviewer (gcanlin) commented "I think it will also work on NPU. Let me test it quickly :)"
No technical reason is provided for why NPU wouldn't support this

Recommendation: Either:

Change NPU to return True if testing confirms it works
Add a comment explaining the technical limitation if NPU truly can't support it

3. Test coverage doesn't include new platforms

The existing test at tests/e2e/offline_inference/test_diffusion_cpu_offload.py:51 explicitly skips NPU and ROCm:

@pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), 
                    reason="Hardware not supported")

This contradicts the PR's goal of enabling these platforms.

Recommendation: Update the test to:

Remove the skipif for ROCm (since it now returns True)
Either remove NPU from skipif or keep it if NPU truly doesn't support offloading
Add XPU to the hardware_test decorator resources

4. Missing documentation

The PR checklist mentions updating documentation, but no docs were changed. Users need to know:

Which platforms now support CPU offloading
Any platform-specific limitations or configuration

Recommendation: Update relevant documentation to reflect the new platform support.

5. Incomplete test plan

The PR description shows manual testing output for XPU but:

No test plan section is filled out
No automated test results provided
No comparison of before/after behavior

Minor observations:

The warning message improvement is good (shows device name)
The logic change from not is_cuda() to not supports_cpu_offload() is semantically correct
Code style is consistent with the existing codebase

Verdict

The core idea is sound and the implementation approach is correct, but the PR needs work before merging:

Must fix:

Add supports_cpu_offload() to the base interface
Clarify NPU support status (test it or document why it's disabled)
Update test coverage to match the new platform support

Should fix:
4. Update documentation
5. Provide complete test results

The PR is on the right track but incomplete in its current state.

hsliuustc0106

Adding inline comments for specific issues

hsliuustc0106

Inline code review comments

vllm_omni/platforms/npu/platform.py

vllm_omni/diffusion/offloader/__init__.py

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106

Additional inline comments

vllm_omni/platforms/cuda/platform.py

vllm_omni/platforms/xpu/platform.py

hsliuustc0106 · 2026-02-26T03:12:52Z

vllm_omni/platforms/rocm/platform.py

    def get_free_memory(cls, device: torch.device | None = None) -> int:
        free, _ = torch.cuda.mem_get_info(device)
        return free
+


ROCm now supports CPU offloading. The test at tests/e2e/offline_inference/test_diffusion_cpu_offload.py:51 should be updated to remove ROCm from the skipif condition.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106

Review Summary

This PR introduces a good abstraction for CPU offloading support across platforms, but needs several fixes before merging:

Must fix:

Add supports_cpu_offload() to the base OmniPlatform interface
Clarify NPU support status (enable it or document why it's disabled)
Update test coverage to match new platform support

Should fix:
4. Update documentation for new platform support
5. Provide complete test results

See inline comments and the detailed review comment for specifics.

tjtanaa · 2026-02-26T07:42:34Z

@xuechendi can you remove the @pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), reason="Hardware not supported") from the xpu_cpu_offloading/tests/e2e/offline_inference/test_diffusion_cpu_offload.py to trigger the tests on non-cuda? Thanks

I have run the test locally and it is passing.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-02-26T16:35:52Z

@tjtanaa , done, testskip is removed

xuechendi · 2026-02-26T19:14:34Z

I realized if we need to enable UT for non-cuda device, this PR is not enough because we also need to generialize GpuMemoryMonitor, I initiate a new PR for this purpose - #1526

@gcanlin @hsliuustc0106 @tjtanaa, I didn't put them in same PR because they are related but not same topic. Please help to check

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-02-27T03:36:33Z

@hsliuustc0106 , please review again, thanks, PR is rebased

david6666666 · 2026-02-27T06:21:12Z

LGTM

hsliuustc0106

lgtm

xuechendi requested a review from hsliuustc0106 as a code owner February 25, 2026 23:29

chatgpt-codex-connector bot reviewed Feb 25, 2026

View reviewed changes

vllm_omni/diffusion/offloader/__init__.py Show resolved Hide resolved

gcanlin reviewed Feb 26, 2026

View reviewed changes

hsliuustc0106 reviewed Feb 26, 2026

View reviewed changes

vllm_omni/platforms/npu/platform.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/offloader/__init__.py Show resolved Hide resolved

[Platform] Enable layerwise offload on all hardware

92a574c

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 reviewed Feb 26, 2026

View reviewed changes

fix

cbed8a2

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 requested changes Feb 26, 2026

View reviewed changes

gcanlin and others added 5 commits February 26, 2026 08:31

update test

799b437

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

remove import

fbeeba4

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Merge branch 'main' into layerwise

5f28fc7

enable cpu_offloading for non_cuda

0a90268

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

move check to base class

1d50daf

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the xpu_cpu_offloading branch from 1df223c to 1d50daf Compare February 26, 2026 16:31

remove testskip for npu and rocm in cpu_offloading test

1b91eb0

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

gcanlin approved these changes Feb 26, 2026

View reviewed changes

xuechendi mentioned this pull request Feb 26, 2026

update GpuMemoryMonitor to DeviceMemoryMonitor for all HW #1526

Open

5 tasks

xuechendi added 4 commits February 26, 2026 19:47

update empty_cache to use current_omni_platform

055419f

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

add log for cpu-offload

92487b9

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Merge branch 'main' into xpu_cpu_offloading

4ce1171

make pre-commit happy after reboot

caec3c3

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi changed the title ~~[XPU][NPU][RMC] enable cpu_offloading flag for non_cuda~~ [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda Feb 26, 2026

xuechendi requested a review from hsliuustc0106 February 27, 2026 03:49

xuechendi requested a review from tjtanaa February 27, 2026 03:49

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 27, 2026

david6666666 approved these changes Feb 27, 2026

View reviewed changes

hsliuustc0106 approved these changes Feb 27, 2026

View reviewed changes

hsliuustc0106 merged commit 482f9b8 into vllm-project:main Feb 27, 2026
7 checks passed

Conversation

xuechendi commented Feb 25, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

gcanlin Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

xuechendi Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Feb 26, 2026

PR Review: [XPU][NPU][RMC] enable cpu_offloading flag for non_cuda (#1488)

Summary

Code Changes Analysis

1. Missing base interface declaration (Critical)

2. NPU platform returns False without justification

3. Test coverage doesn't include new platforms

4. Missing documentation

5. Incomplete test plan

Minor observations:

Verdict

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

tjtanaa commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Feb 26, 2026

Uh oh!

xuechendi commented Feb 26, 2026

Uh oh!

xuechendi commented Feb 27, 2026

Uh oh!

david6666666 commented Feb 27, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin Feb 26, 2026 •

edited

Loading

tjtanaa commented Feb 26, 2026 •

edited

Loading