Skip to content

Commit 73cadec

Browse files
authored
[P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514)
### What this PR does / why we need it? Fix mooncake layerconnector dead when update_decoder_info fail. For the scenario where node D is dead, node P failing to update_decoder_info should not cause node P to become dead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? by CI - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8b63257 --------- Signed-off-by: liziyu <liziyu16@huawei.com>
1 parent 67aad1f commit 73cadec

File tree

1 file changed

+9
-1
lines changed

1 file changed

+9
-1
lines changed

vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1626,7 +1626,14 @@ def save_kv_layer(
16261626
for req_id, req_meta in connector_metadata.requests.items():
16271627
if len(req_meta.local_block_ids[layer_group_idx]) == 0:
16281628
continue
1629-
req_meta_update = self.update_decoder_info(req_id, req_meta)
1629+
try:
1630+
req_meta_update = self.update_decoder_info(req_id, req_meta)
1631+
except Exception as e:
1632+
logger.warning(
1633+
f"MooncakeLayerwiseConnector transfer fail for req_id {req_id} in layer_idx "
1634+
f"{self.current_layer}, update_decoder_info with error: {e}"
1635+
)
1636+
continue
16301637
logger.debug(f"Add request {req_id} to kv send layer thread. {req_meta_update=}")
16311638
layer_send_task.send_request[req_id] = req_meta_update
16321639

@@ -1681,6 +1688,7 @@ def update_decoder_info(self, req_id, req_meta: ReqMeta):
16811688
f"from {req_meta.remote_host}:{req_meta.remote_port}"
16821689
f"fail with error: {e}"
16831690
)
1691+
raise e
16841692
assert req_meta.remote_engine_id != self.engine_id, (
16851693
f"Conflict engine id {req_meta.remote_engine_id} with local engine id {self.local_engine_id}."
16861694
)

0 commit comments

Comments
 (0)