Skip to content

[XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda#1488

Merged
hsliuustc0106 merged 12 commits intovllm-project:mainfrom
xuechendi:xpu_cpu_offloading
Feb 27, 2026
Merged

[XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda#1488
hsliuustc0106 merged 12 commits intovllm-project:mainfrom
xuechendi:xpu_cpu_offloading

Conversation

@xuechendi
Copy link
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Current cpu_offloading is only open to CUDA. However, CPU offloading is also very useful feature due to memory capacity issue such as intel arc B60.

This PR aims to provide a new way to decide if certain device should provide cpu-offloading capability or not.

Verified on XPU
After this PR, I can confirm cpu_offloading is effective to XPU

[Stage-0] INFO 02-25 23:21:58 [diffusers_loader.py:266] Loading weights took 0.99 seconds
[Stage-0] INFO 02-25 23:21:59 [diffusion_model_runner.py:117] Model loading took 0.1617 GiB and 2.941193 seconds
[Stage-0] INFO 02-25 23:21:59 [diffusion_model_runner.py:122] Model runner: Model loaded successfully.
[Stage-0] INFO 02-25 23:21:59 [diffusion_model_runner.py:127]  Enabling offloader backend: ModelLevelOffloadBackend
[Stage-0] DEBUG 02-25 23:22:01 [sequential_backend.py:122] Registered offload hook for Flux2Transformer2DModel
[Stage-0] DEBUG 02-25 23:22:01 [sequential_backend.py:133] Registered offload hook for Qwen3ForCausalLM

[Stage-0] DEBUG 02-25 23:22:01 [sequential_backend.py:78] Swapped: ['Flux2Transformer2DModel'] -> CPU, Qwen3ForCausalLM -> GPU
[Stage-0] DEBUG 02-25 23:22:18 [sequential_backend.py:78] Swapped: ['Qwen3ForCausalLM'] -> CPU, Flux2Transformer2DModel -> GPU

[Stage-0] DEBUG 02-25 23:22:33 [sequential_backend.py:78] Swapped: ['Flux2Transformer2DModel'] -> CPU, Qwen3ForCausalLM -> GPU
[Stage-0] DEBUG 02-25 23:22:36 [sequential_backend.py:78] Swapped: ['Qwen3ForCausalLM'] -> CPU, Flux2Transformer2DModel -> GPU
[Stage-0] DEBUG 02-25 23:22:37 [sequential_backend.py:78] Swapped: ['Qwen3ForCausalLM'] -> CPU, Flux2Transformer2DModel -> GPU
...

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8ed71998f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think it will also work on NPU. Let me test it quickly :)

Comment on lines +58 to 63
if not current_omni_platform.supports_cpu_offload() or current_omni_platform.get_device_count() < 1:
logger.warning(
"Current device: %s does not support CPU offloading. Skipping offloading.",
current_omni_platform.get_device_name(),
)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested cpu offload on NPU and it can work now. Could you please directly remove this check? This feature should be always adapted on every platform :)

Suggested change
if not current_omni_platform.supports_cpu_offload() or current_omni_platform.get_device_count() < 1:
logger.warning(
"Current device: %s does not support CPU offloading. Skipping offloading.",
current_omni_platform.get_device_name(),
)
return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcanlin @hsliuustc0106 @xuechendi is this regular cpu offloading for layerwise cpu offloading?

Copy link
Contributor

@gcanlin gcanlin Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcanlin @hsliuustc0106 @xuechendi is this regular cpu offloading for layerwise cpu offloading?

This PR is working on enabling the entry of cpu offloading that includes layerwise.

Layerwise cpu offloading is enabled by #1492.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, why we need to have 2 PRs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, why we need to have 2 PRs?

Not really needed. Just open #1492 to enable cpu offload more broadly. @xuechendi Would you mind I directly remove this if condition in my PR, which also enables layerwise offload for other platforms? Feel free to take my PR into this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I have rebased #1492 to this PR
Meanwhile, I prefer to keep the option for future HW support, so I move the check to base class with default true

@hsliuustc0106
Copy link
Collaborator

PR Review: [XPU][NPU][RMC] enable cpu_offloading flag for non_cuda (#1488)

Summary

This PR enables CPU offloading capability for non-CUDA platforms (XPU, NPU, ROCm) by introducing a new platform capability method supports_cpu_offload(). Previously, CPU offloading was hardcoded to only work with CUDA devices.

Code Changes Analysis

Positive aspects:

  • Clean abstraction: Replaces hardcoded is_cuda() check with a capability-based approach
  • Minimal changes: Only 21 additions, 2 deletions across 5 files
  • Consistent implementation: All platform classes get the new method
  • Better error messaging: Now shows which device doesn't support offloading

Issues identified:

1. Missing base interface declaration (Critical)

The supports_cpu_offload() method is added to all platform implementations but is NOT declared in the OmniPlatform base class at vllm_omni/platforms/interface.py:20-100. This breaks the interface contract.

Recommendation: Add to interface.py:

@classmethod
def supports_cpu_offload(cls) -> bool:
    """Check if the platform supports CPU offloading for models."""
    raise NotImplementedError

2. NPU platform returns False without justification

In vllm_omni/platforms/npu/platform.py:91, NPU returns False while XPU, CUDA, and ROCm return True. However:

  • The PR description states the author verified it works on XPU
  • A reviewer (gcanlin) commented "I think it will also work on NPU. Let me test it quickly :)"
  • No technical reason is provided for why NPU wouldn't support this

Recommendation: Either:

  • Change NPU to return True if testing confirms it works
  • Add a comment explaining the technical limitation if NPU truly can't support it

3. Test coverage doesn't include new platforms

The existing test at tests/e2e/offline_inference/test_diffusion_cpu_offload.py:51 explicitly skips NPU and ROCm:

@pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), 
                    reason="Hardware not supported")

This contradicts the PR's goal of enabling these platforms.

Recommendation: Update the test to:

  • Remove the skipif for ROCm (since it now returns True)
  • Either remove NPU from skipif or keep it if NPU truly doesn't support offloading
  • Add XPU to the hardware_test decorator resources

4. Missing documentation

The PR checklist mentions updating documentation, but no docs were changed. Users need to know:

  • Which platforms now support CPU offloading
  • Any platform-specific limitations or configuration

Recommendation: Update relevant documentation to reflect the new platform support.

5. Incomplete test plan

The PR description shows manual testing output for XPU but:

  • No test plan section is filled out
  • No automated test results provided
  • No comparison of before/after behavior

Minor observations:

  • The warning message improvement is good (shows device name)
  • The logic change from not is_cuda() to not supports_cpu_offload() is semantically correct
  • Code style is consistent with the existing codebase

Verdict

The core idea is sound and the implementation approach is correct, but the PR needs work before merging:

Must fix:

  1. Add supports_cpu_offload() to the base interface
  2. Clarify NPU support status (test it or document why it's disabled)
  3. Update test coverage to match the new platform support

Should fix:
4. Update documentation
5. Provide complete test results

The PR is on the right track but incomplete in its current state.

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding inline comments for specific issues

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline code review comments

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional inline comments

def get_free_memory(cls, device: torch.device | None = None) -> int:
free, _ = torch.cuda.mem_get_info(device)
return free

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCm now supports CPU offloading. The test at tests/e2e/offline_inference/test_diffusion_cpu_offload.py:51 should be updated to remove ROCm from the skipif condition.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR introduces a good abstraction for CPU offloading support across platforms, but needs several fixes before merging:

Must fix:

  1. Add supports_cpu_offload() to the base OmniPlatform interface
  2. Clarify NPU support status (enable it or document why it's disabled)
  3. Update test coverage to match new platform support

Should fix:
4. Update documentation for new platform support
5. Provide complete test results

See inline comments and the detailed review comment for specifics.

@tjtanaa
Copy link
Contributor

tjtanaa commented Feb 26, 2026

@xuechendi can you remove the @pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), reason="Hardware not supported") from the xpu_cpu_offloading/tests/e2e/offline_inference/test_diffusion_cpu_offload.py to trigger the tests on non-cuda? Thanks

I have run the test locally and it is passing.

gcanlin and others added 5 commits February 26, 2026 08:31
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi
Copy link
Contributor Author

@tjtanaa , done, testskip is removed

@xuechendi
Copy link
Contributor Author

I realized if we need to enable UT for non-cuda device, this PR is not enough because we also need to generialize GpuMemoryMonitor, I initiate a new PR for this purpose - #1526

@gcanlin @hsliuustc0106 @tjtanaa, I didn't put them in same PR because they are related but not same topic. Please help to check

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi changed the title [XPU][NPU][RMC] enable cpu_offloading flag for non_cuda [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda Feb 26, 2026
@xuechendi
Copy link
Contributor Author

@hsliuustc0106 , please review again, thanks, PR is rebased

@xuechendi xuechendi requested a review from tjtanaa February 27, 2026 03:49
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 27, 2026
@david6666666
Copy link
Collaborator

LGTM

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit 482f9b8 into vllm-project:main Feb 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants