Skip to content

Why the service was interrupted sometimes #270

@yangqinghao-cmss

Description

@yangqinghao-cmss

xllm service had interrupted:

[root@pod-1427689232471265280 scripts]# ENABLE_DECODE_RESPONSE_TO_SERVICE=true xllm_master_serving   --etcd_addr="127.0.0.1:2379"   --http_server_port 28888   --rpc_server_port 28889   --tokenizer_path=/workspace/cache/grtest-zz/cmss-hongfugui/deepseek-r1-0528-mtp
I20251022 15:08:56.082630 140899 master.cpp:157] Starting xllm master service.
I20251022 15:08:56.158654 140899 tokenizer_factory.cpp:18] Create fast tokenizer.
I20251022 15:08:56.516822 140899 jinja_chat_template.cpp:30] Jinja chat template init succeed.
I20251022 15:08:56.521587 140899 etcd_client.cpp:26] EtcdClient init put start!
I20251022 15:08:56.528378 140899 etcd_client.cpp:28] EtcdClient init put end!
E20251022 15:08:56.529011 140899 etcd_client.cpp:104] etcd get XLLM:SERVICE:MASTER failed: etcd-cpp-apiv3: key not found
I20251022 15:08:56.531591 140899 scheduler.cpp:37] Set current service as master!
I20251022 15:08:56.533705 140899 instance_mgr.cpp:74] Load instance info from etcd:0
I20251022 15:08:56.544333 141110 server.cpp:1200] Server[xllm_service::XllmRpcService] is serving on port=28889.
I20251022 15:08:56.544335 141109 server.cpp:1200] Server[xllm_service::XllmHttpServiceImpl] is serving on port=28888.
I20251022 15:08:56.544373 141110 server.cpp:1203] Check out http://pod-1427689232471265280:28889 in web browser.
I20251022 15:08:56.544387 141109 server.cpp:1203] Check out http://pod-1427689232471265280:28888 in web browser.
I20251022 15:08:56.544584 141110 master.cpp:134] Xllm rpc server started on: 0.0.0.0:28889
I20251022 15:08:56.544582 141109 master.cpp:97] Xllm http server started on: 0.0.0.0:28888
W20251022 15:08:56.544605 141110 controller.cpp:1606] SIGINT was installed with 1
W20251022 18:35:33.321612 141130 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:39:14.988564 141136 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:39:38.066045 141136 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:45:57.154168 141124 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:48:30.089658 141114 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:54:01.736268 141129 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:54:11.796141 141141 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:58:15.388222 141116 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:00:34.459403 141125 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:03:54.569886 141116 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:11:16.909958 141143 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:16:37.472590 141129 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:18:54.291129 141128 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:23:19.276441 141138 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:38:23.268654 141144 service.cpp:98] Fail to send stream generation, [E101]Fail to connect Socket{id=2234 addr=172.20.185.82:14830} (0x0xfffd5002c0f0): Network is unreachable
E20251022 19:38:23.268796 141106 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
E20251022 19:38:23.269690 141104 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
E20251022 19:38:23.269807 141104 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
W20251022 19:38:23.273672 141114 service.cpp:98] Fail to send stream generation, [E101]Fail to connect Socket{id=3171 addr=172.20.185.82:14830} (0x0xfffd3c030f00): Network is unreachable
W20251022 19:38:23.273731 141114 service.cpp:98] Fail to send stream generation, [E110]Fail to connect Socket{id=2400 addr=172.20.185.82:14830} (0x0xfffd2c02b4f0): Connection timed out [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
F20251022 19:38:23.273810 141114 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=40969693036549 current_cid=40969693036546 initial_cid=40969693036545 stream_user_data=0 sending_sock=0
*** Check failure stack trace: ***
E20251022 19:38:23.273854 141107 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
    @           0xf625dc  google::LogMessage::SendToLog()
F20251022 19:38:23.275838 141120 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=44233868181509 current_cid=44233868181506 initial_cid=44233868181505 stream_user_data=0 sending_sock=0
*** Check failure stack trace: ***
E20251022 19:38:23.276041 141077 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
F20251022 19:38:23.275838 141120 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=44233868181509 current_cid=44233868181506 initial_cid=44233868181505 stream_user_data=0 sending_sock=0F20251022 19:38:23.277345 141127 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=40969693036549 current_cid=40969693036546 initial_cid=40969693036545 stream_user_data=0 sending_sock=0
*** Check failure stack trace: ***

At the same time, the prefill instance node0, xllm also had interrupted:

...
I20251022 16:25:46.491890 141329 llm_engine.cpp:247] kv cache capacity: bytes: 10113915393, blocks: 1124, slot_size: 1152
I20251022 16:25:46.492002 141329 llm_engine.cpp:277] Initializing k cache with shape: [1124 128 1 512]
I20251022 16:25:46.492102 141329 llm_engine.cpp:278] Initializing v cache with shape: [1124 128 1 64]
I20251022 16:25:46.493528 143701 llm_data_dist_transfer.cpp:44] Create LlmDataDistTransfer for prefill instance.
I20251022 16:25:58.824947 143701 llm_data_dist_transfer.cpp:76] Initialize LlmDataList success.
E20251022 16:26:00.130987 141329 xllm_server_registry.cpp:37] Server DisaggPDServer not existed.
I20251022 16:26:00.133548 146036 server.cpp:1200] Server[xllm::DisaggPDService] is serving on port=7777.
I20251022 16:26:00.133733 146036 server.cpp:1203] Check out http://pod-1427689232471265280:7777 in web browser.
I20251022 16:26:00.133850 146036 xllm_server.cpp:172] Disagg PD server started on address 172.20.185.82:7777, idle_timeout_sec: -1, num_threads: 32, max_concurrency: 0
I20251022 16:26:00.231866 141329 disagg_pd_scheduler.cpp:69] Instance info: instance name = 172.20.185.82:14830, instance rpc_address = 172.20.185.82:7777, instance type = PREFILL
I20251022 16:26:21.218425 141329 xservice_client.cpp:183] Success register instance to etcd.
I20251022 16:26:21.239382 141329 jinja_chat_template.cpp:30] Jinja chat template init succeed.
I20251022 16:26:21.586486 141329 server.cpp:1200] Server[xllm::APIService] is serving on port=14830.
I20251022 16:26:21.586807 141329 server.cpp:1203] Check out http://pod-1427689232471265280:14830 in web browser.
I20251022 16:26:21.586951 141329 xllm_server.cpp:59] Brpc Server started on port 14830, idle_timeout_sec: -1, num_threads: 32, max_concurrency: 0
W20251022 18:08:59.582523 146035 disagg_pd_scheduler.cpp:144] Failed to find channel to instance: 172.20.56.213:14830, try to create channel now.
W20251022 19:38:33.252043 141482 input_messenger.cpp:387] Fail to read from Socket{id=50 fd=114 addr=172.20.185.82:28889:39850} (0x496bff50): Connection reset by peer [104]
E20251022 19:38:33.252705 145898 xservice_client.cpp:362] Failed to send heartbeat to xservice 172.20.185.82:28889, error msg is: [E104]Fail to read from Socket{id=50fd=114 addr=172.20.185.82:28889:39850} (0x0x496bff50): Connection reset by peer [R1][E111]Fail to connect Socket{id=17179872765 addr=172.20.185.82:28889} (0x0xffff1020a160): Connection refused [R2][E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R3][E112]Not connected to 172.20.185.82:28889 yet, server_id=49
I20251022 19:38:33.352806 141483 socket.cpp:2591] Checking Socket{id=49 addr=172.20.185.82:28889} (0x496bfc50)
E20251022 19:38:33.753262 145898 xservice_client.cpp:362] Failed to send heartbeat to xservice 172.20.185.82:28889, error msg is: [E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R1][E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R2][E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R3][E112]Not connected to 172.20.185.82:28889 yet, server_id=49
...

I service 2 nodes with 16 cards as a Prefill instance and 2 nodes with 16 cards as a Decode instance. Here's the script:

export PYTHON_INCLUDE_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"
export PYTHON_LIB_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"
export PYTORCH_NPU_INSTALL_PATH=/usr/local/libtorch_npu/
export PYTORCH_INSTALL_PATH="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LIBTORCH_ROOT="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LD_LIBRARY_PATH=/usr/local/libtorch_npu/lib:$LD_LIBRARY_PATH

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASDOPS_LOG_TO_STDOUT=1
export ASDOPS_LOG_LEVEL=ERROR
export ATB_LOG_TO_STDOUT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export NPU_MEMORY_FRACTION=0.98
export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3
export ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1

export OMP_NUM_THREADS=12

export HCCL_CONNECT_TIMEOUT=7200
export HCCL_OP_EXPANSION_MODE="AIV"

\rm -rf /root/atb/log/
\rm -rf /root/ascend/log/
\rm -rf core.*


MODEL_PATH="/workspace/cache/grtest-zz/cmss-hongfugui/deepseek-r1-0528-mtp"
MASTER_NODE_ADDR="172.20.185.82:9590"
START_PORT=14830
START_LISTEN_PORT=26000
START_DEVICE=0
LOG_DIR="log/deepseek_w8a8/p1/node1"
NNODES=8
WORLD_SIZE=16

export HCCL_IF_BASE_PORT=43439

\rm -rf "$LOG_DIR"
mkdir -p "$LOG_DIR"

for (( i=0; i<$NNODES; i++ ))
do
  PORT=$((START_PORT + i))
  DEVICE=$((START_DEVICE + i))
  LOG_FILE="$LOG_DIR/node_$i.log"
  DEVICE_IP=$(hccn_tool -i $i -ip -g | grep ipaddr | awk -F ':' '{print $2}')
  LISTEN_PORT=$((START_LISTEN_PORT + i))
  xllm \
    --model $MODEL_PATH \
    --model_id deepseek \
    --port $PORT \
    --devices="npu:$DEVICE" \
    --master_node_addr=$MASTER_NODE_ADDR \
    --nnodes=$WORLD_SIZE \
    --node_rank=$i \
    --device_ip $DEVICE_IP \
    --max_memory_utilization=0.9 \
    --max_tokens_per_batch=20000 \
    --max_seqs_per_batch=3000 \
    --block_size=128 \
    --enable_prefix_cache=false \
    --enable_chunked_prefill=false \
    --communication_backend="hccl" \
    --enable_schedule_overlap=true \
    --enable_mla=true \
    --dp_size=4 \
    --ep_size=4 \
    --rank_tablefile /workspace/cache/grtest-zz/yqh/scripts/gen_ranktable/rank_table.json \
    --enable_disagg_pd=true \
    --instance_role=PREFILL \
    --etcd_addr=172.20.185.82:2379 \
    --transfer_listen_port=$LISTEN_PORT \
    > $LOG_FILE 2>&1 &
done

The requests are:

evalscope perf   --parallel 1 8 64 128 256 512   --number 5 40 320 640 1280 2560  --model deepseek   --url http://127.0.0.1:28888/v1/chat/completions    --api openai   --dataset random   --max-tokens 2048   --min-tokens 2048   --prefix-length 0   --min-prompt-length  7168   --max-prompt-length  7168   --tokenizer-path /workspace/cache/grtest-zz/cmss-hongfugui/deepseek-r1-0528-mtp   --extra-args '{"ignore_eos": true}'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions