[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT #3209

jesse996 · 2025-09-26T10:05:01Z

This PR is based on top of vllm-project/vllm#22760

What this PR does / why we need it?

When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync.

Does this PR introduce any user-facing change?

How was this patch tested?

Bring up vLLM server

VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000

Before：

After

As shown in the figure, the TTFT decreased

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@17b4c66

Signed-off-by: jesse <[email protected]>

github-actions · 2025-09-26T10:05:12Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a performance optimization for NPU devices by replacing a blocking tolist() call with a non-blocking copy and event synchronization. This avoids device-wide stalls and improves performance for disaggregated setups. The changes are sound, but I've identified a critical bug in the buffer allocation that could lead to runtime errors under certain configurations. I've also suggested an improvement to the new unit test to make it more robust and maintainable.

gemini-code-assist · 2025-09-26T10:06:41Z

vllm_ascend/worker/model_runner_v1.py

+        self.sampled_token_ids_pinned_cpu = torch.empty(
+            (self.max_model_len, 1),
+            dtype=torch.int64,
+            device="cpu",
+            pin_memory=True)


The sampled_token_ids_pinned_cpu buffer is sized using self.max_model_len, but it's used to hold sampled_token_ids which can have up to self.max_num_reqs requests. If max_num_reqs is configured to be larger than max_model_len, this will cause a runtime error due to a size mismatch during the copy_ operation in _to_list. The buffer should be sized using self.max_num_reqs, which correctly represents the maximum number of requests in a batch.

Suggested change

self.sampled_token_ids_pinned_cpu = torch.empty(

(self.max_model_len, 1),

dtype=torch.int64,

device="cpu",

pin_memory=True)

self.sampled_token_ids_pinned_cpu = torch.empty(

(self.max_num_reqs, 1),

dtype=torch.int64,

device="cpu",

pin_memory=True)

gemini-code-assist · 2025-09-26T10:06:41Z

tests/ut/worker/test_model_runner_v1.py

+def test_init_creates_transfer_event_and_pinned_memory(mock_torch,
+                                                       mock_torch_npu):
+    """Test that initialization creates transfer event and pinned CPU memory."""
+    # This is a simplified test focusing only on the new attributes
+    # We mock the entire __init__ process and only test the specific lines we added
+
+    # Mock torch.empty to return a mock tensor
+    mock_pinned_tensor = MagicMock()
+    mock_torch.empty.return_value = mock_pinned_tensor
+
+    # Mock torch_npu.npu.Event - 需要设置嵌套的 mock 结构
+    mock_event = MagicMock()
+    mock_torch_npu.npu.Event.return_value = mock_event
+
+    # Create a runner instance using __new__ to bypass __init__
+    runner = NPUModelRunner.__new__(NPUModelRunner)
+
+    # Manually set the attributes we need for our test
+    runner.max_model_len = 2048
+
+    # Test the specific lines from the commit
+    runner.transfer_event = mock_torch_npu.npu.Event()
+    runner.sampled_token_ids_pinned_cpu = mock_torch.empty(
+        (runner.max_model_len, 1),
+        dtype=torch.int64,
+        device="cpu",
+        pin_memory=True)
+
+    # Verify max_model_len is set
+    assert runner.max_model_len == 2048
+
+    # Verify transfer_event is created
+    assert runner.transfer_event == mock_event
+    mock_torch_npu.npu.Event.assert_called_once()
+
+    # Verify pinned CPU memory is created with correct parameters
+    assert runner.sampled_token_ids_pinned_cpu == mock_pinned_tensor
+    mock_torch.empty.assert_called_with((2048, 1),
+                                        dtype=torch.int64,
+                                        device="cpu",
+                                        pin_memory=True)


This test is fragile because it bypasses __init__ and duplicates the implementation logic for creating transfer_event and sampled_token_ids_pinned_cpu within the test body. This makes the test hard to maintain, as changes in __init__ might not be reflected here, leading to the test passing while the actual code is broken, or vice-versa.

A better approach is to test the behavior of __init__ by calling it and asserting the results, while mocking its complex dependencies. Alternatively, the logic for initializing these new attributes could be extracted into a separate helper method within NPUModelRunner, which can then be called and tested directly. This would avoid code duplication and make the test more robust.

For example, you could refactor the NPUModelRunner like this:

class NPUModelRunner: def __init__(self, ...): # ... existing init code ... self._init_transfer_resources() def _init_transfer_resources(self): self.transfer_event = torch_npu.npu.Event() self.sampled_token_ids_pinned_cpu = torch.empty( (self.max_num_reqs, 1), dtype=torch.int64, device="cpu", pin_memory=True)

And the test would become:

@patch('vllm_ascend.worker.model_runner_v1.torch_npu') @patch('vllm_ascend.worker.model_runner_v1.torch') def test_init_transfer_resources(mock_torch, mock_torch_npu): # ... mock setup ... runner = NPUModelRunner.__new__(NPUModelRunner) runner.max_num_reqs = 64 runner._init_transfer_resources() mock_torch_npu.npu.Event.assert_called_once() mock_torch.empty.assert_called_with((64, 1), ...) # ... other assertions ...

This approach tests the logic without duplicating it.

wangxiyuan · 2025-09-26T11:54:22Z

this change doesn't work with CANN8.3, we're working on it.

jesse996 added 16 commits September 5, 2025 23:12

use event sync

b6c5ef9

Signed-off-by: jesse <[email protected]>

add test

9816a36

Signed-off-by: jesse <[email protected]>

update test

f14a98b

Signed-off-by: jesse <[email protected]>

fix test

beabae4

Signed-off-by: jesse <[email protected]>

fix test

3da83fe

Signed-off-by: jesse <[email protected]>

fix test

1695f5f

Signed-off-by: jesse <[email protected]>

fix test

c483b20

Signed-off-by: jesse <[email protected]>

fix test

ed0b72f

Signed-off-by: jesse <[email protected]>

fix test

1f9cb35

Signed-off-by: jesse <[email protected]>

fix test

9c8fb4c

Signed-off-by: jesse <[email protected]>

Merge branch 'main' into event-sync

5be58d5

Signed-off-by: jesse <[email protected]>

update test

598c896

Signed-off-by: jesse <[email protected]>

update test

674be75

Signed-off-by: jesse <[email protected]>

Merge branch 'main' into event-sync

4588d12

update test

d81f665

Signed-off-by: jesse <[email protected]>

update comment

dd4c177

Signed-off-by: jesse <[email protected]>

github-actions bot added the module:tests label Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT #3209

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT #3209

Uh oh!

jesse996 commented Sep 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 26, 2025

Uh oh!

gemini-code-assist bot Sep 26, 2025

Uh oh!

wangxiyuan commented Sep 26, 2025

Uh oh!

Uh oh!

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT #3209

Are you sure you want to change the base?

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT #3209

Uh oh!

Conversation

jesse996 commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Before：

After

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Sep 26, 2025

Uh oh!

Uh oh!

jesse996 commented Sep 26, 2025 •

edited by github-actions bot

Loading