Skip to content

Commit 69509bc

Browse files
authored
[bugfix] fix oom in aclgraph (#3158)
### What this PR does / why we need it? fix oom in aclgraph. 1. In the current token dispatch implementation, tensors are mounted on class instances to facilitate parameter passing between different methods. This approach prevents automatic recycling of these tensors. In some cases, it may lead to out-of-memory error. To address this issue, we manually set these tensors to None to release corresponding memory. 2. The `profile_run` method is designed to accurately estimate the maximum NPU memory usage during vLLM inference. However, in certain scenarios, MoE models perform inference via MC2, which includes communication and consumes additional NPU memory. This leads to inaccurate estimation by the profile run. We address this by actively triggering the MC2 during profile run for initialization.```. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: WithHades <[email protected]>
1 parent 621aa7d commit 69509bc

File tree

2 files changed

+32
-0
lines changed

2 files changed

+32
-0
lines changed

vllm_ascend/ops/moe/token_dispatcher.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -272,6 +272,16 @@ def token_combine(self,
272272
**kwargs_mc2
273273
) if self.enable_dispatch_v2 else torch_npu.npu_moe_distribute_combine(
274274
**kwargs_mc2)
275+
276+
# these values are no longer used, so they need to be set to None for memory release.
277+
self.output = None
278+
self.assist_info_for_combine = None
279+
self.ep_recv_counts = None
280+
self.topk_ids = None
281+
self.topk_weights = None
282+
self.mc2_mask = None
283+
self.expert_map = None
284+
275285
if self.shared_experts is None:
276286
return hidden_states
277287
else:
@@ -281,6 +291,9 @@ def token_combine(self,
281291
else:
282292
shared_hidden_states, _ = self.shared_experts.down_proj(
283293
self.shared_act)
294+
self.shared_act = None
295+
self.shared_experts = None
296+
self.swiglu_out_scale = None
284297
return hidden_states, shared_hidden_states
285298

286299

@@ -374,6 +387,12 @@ def token_combine(self,
374387
probs=self.topk_weights)
375388
if len(self.original_shape) == 3:
376389
final_hidden_states = final_hidden_states.view(self.original_shape)
390+
391+
# these values are no longer used, so they need to be set to None for memory release.
392+
self.expert_map = None
393+
self.topk_weights = None
394+
self.topk_ids = None
395+
self.expanded_row_idx = None
377396
return final_hidden_states
378397

379398

@@ -564,9 +583,14 @@ def token_combine(self,
564583

565584
output = self._combine_postprocess(permutated_local_input_tokens)
566585

586+
# these values are no longer used, so they need to be set to None for memory release.
567587
self.input_splits = None
568588
self.output_splits = None
569589
self.num_global_tokens_per_local_expert = None
590+
self.topk_weights = None
591+
self.reversed_local_input_permutation_mapping = None
592+
self.reversed_global_input_permutation_mapping = None
593+
self.global_input_tokens_local_experts_indices = None
570594

571595
return output
572596

vllm_ascend/worker/model_runner_v1.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2507,6 +2507,14 @@ def profile_run(self) -> None:
25072507
with self.set_in_profile_run():
25082508
hidden_states = self._dummy_run(self.max_num_tokens,
25092509
with_prefill=True)
2510+
# MC2 will consume additional NPU memory.
2511+
# Therefore, we need to run the MC2 path once here to complete its initialization,
2512+
# allowing vLLM to correctly estimate the maximum memory required.
2513+
if self._select_moe_comm_method(
2514+
self.mc2_tokens_capacity,
2515+
with_prefill=True) == MoECommType.MC2:
2516+
self._dummy_run(self.mc2_tokens_capacity)
2517+
25102518
output = None
25112519
if get_pp_group().is_last_rank:
25122520
if self.is_pooling_model:

0 commit comments

Comments
 (0)