-
Notifications
You must be signed in to change notification settings - Fork 77
Open
Labels
bugSomething isn't workingSomething isn't working
Description
xllm service had interrupted:
[root@pod-1427689232471265280 scripts]# ENABLE_DECODE_RESPONSE_TO_SERVICE=true xllm_master_serving --etcd_addr="127.0.0.1:2379" --http_server_port 28888 --rpc_server_port 28889 --tokenizer_path=/workspace/cache/grtest-zz/cmss-hongfugui/deepseek-r1-0528-mtp
I20251022 15:08:56.082630 140899 master.cpp:157] Starting xllm master service.
I20251022 15:08:56.158654 140899 tokenizer_factory.cpp:18] Create fast tokenizer.
I20251022 15:08:56.516822 140899 jinja_chat_template.cpp:30] Jinja chat template init succeed.
I20251022 15:08:56.521587 140899 etcd_client.cpp:26] EtcdClient init put start!
I20251022 15:08:56.528378 140899 etcd_client.cpp:28] EtcdClient init put end!
E20251022 15:08:56.529011 140899 etcd_client.cpp:104] etcd get XLLM:SERVICE:MASTER failed: etcd-cpp-apiv3: key not found
I20251022 15:08:56.531591 140899 scheduler.cpp:37] Set current service as master!
I20251022 15:08:56.533705 140899 instance_mgr.cpp:74] Load instance info from etcd:0
I20251022 15:08:56.544333 141110 server.cpp:1200] Server[xllm_service::XllmRpcService] is serving on port=28889.
I20251022 15:08:56.544335 141109 server.cpp:1200] Server[xllm_service::XllmHttpServiceImpl] is serving on port=28888.
I20251022 15:08:56.544373 141110 server.cpp:1203] Check out http://pod-1427689232471265280:28889 in web browser.
I20251022 15:08:56.544387 141109 server.cpp:1203] Check out http://pod-1427689232471265280:28888 in web browser.
I20251022 15:08:56.544584 141110 master.cpp:134] Xllm rpc server started on: 0.0.0.0:28889
I20251022 15:08:56.544582 141109 master.cpp:97] Xllm http server started on: 0.0.0.0:28888
W20251022 15:08:56.544605 141110 controller.cpp:1606] SIGINT was installed with 1
W20251022 18:35:33.321612 141130 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:39:14.988564 141136 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:39:38.066045 141136 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:45:57.154168 141124 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:48:30.089658 141114 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:54:01.736268 141129 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:54:11.796141 141141 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 18:58:15.388222 141116 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:00:34.459403 141125 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:03:54.569886 141116 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:11:16.909958 141143 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:16:37.472590 141129 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:18:54.291129 141128 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:23:19.276441 141138 progressive_attachment.cpp:120] Write an empty chunk. To suppress this warning, check emptiness of the chunk before calling ProgressiveAttachment.Write()
W20251022 19:38:23.268654 141144 service.cpp:98] Fail to send stream generation, [E101]Fail to connect Socket{id=2234 addr=172.20.185.82:14830} (0x0xfffd5002c0f0): Network is unreachable
E20251022 19:38:23.268796 141106 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
E20251022 19:38:23.269690 141104 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
E20251022 19:38:23.269807 141104 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
W20251022 19:38:23.273672 141114 service.cpp:98] Fail to send stream generation, [E101]Fail to connect Socket{id=3171 addr=172.20.185.82:14830} (0x0xfffd3c030f00): Network is unreachable
W20251022 19:38:23.273731 141114 service.cpp:98] Fail to send stream generation, [E110]Fail to connect Socket{id=2400 addr=172.20.185.82:14830} (0x0xfffd2c02b4f0): Connection timed out [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
F20251022 19:38:23.273810 141114 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=40969693036549 current_cid=40969693036546 initial_cid=40969693036545 stream_user_data=0 sending_sock=0
*** Check failure stack trace: ***
E20251022 19:38:23.273854 141107 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
@ 0xf625dc google::LogMessage::SendToLog()
F20251022 19:38:23.275838 141120 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=44233868181509 current_cid=44233868181506 initial_cid=44233868181505 stream_user_data=0 sending_sock=0
*** Check failure stack trace: ***
E20251022 19:38:23.276041 141077 service.cpp:184] Redirect to instance error: [E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R1][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R2][E112]Not connected to 172.20.185.82:14830 yet, server_id=255 [R3][E112]Not connected to 172.20.185.82:14830 yet, server_id=255
F20251022 19:38:23.275838 141120 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=44233868181509 current_cid=44233868181506 initial_cid=44233868181505 stream_user_data=0 sending_sock=0F20251022 19:38:23.277345 141127 controller.cpp:904] Check failed: false A previous non-backup request responded, cid=40969693036549 current_cid=40969693036546 initial_cid=40969693036545 stream_user_data=0 sending_sock=0
*** Check failure stack trace: ***At the same time, the prefill instance node0, xllm also had interrupted:
...
I20251022 16:25:46.491890 141329 llm_engine.cpp:247] kv cache capacity: bytes: 10113915393, blocks: 1124, slot_size: 1152
I20251022 16:25:46.492002 141329 llm_engine.cpp:277] Initializing k cache with shape: [1124 128 1 512]
I20251022 16:25:46.492102 141329 llm_engine.cpp:278] Initializing v cache with shape: [1124 128 1 64]
I20251022 16:25:46.493528 143701 llm_data_dist_transfer.cpp:44] Create LlmDataDistTransfer for prefill instance.
I20251022 16:25:58.824947 143701 llm_data_dist_transfer.cpp:76] Initialize LlmDataList success.
E20251022 16:26:00.130987 141329 xllm_server_registry.cpp:37] Server DisaggPDServer not existed.
I20251022 16:26:00.133548 146036 server.cpp:1200] Server[xllm::DisaggPDService] is serving on port=7777.
I20251022 16:26:00.133733 146036 server.cpp:1203] Check out http://pod-1427689232471265280:7777 in web browser.
I20251022 16:26:00.133850 146036 xllm_server.cpp:172] Disagg PD server started on address 172.20.185.82:7777, idle_timeout_sec: -1, num_threads: 32, max_concurrency: 0
I20251022 16:26:00.231866 141329 disagg_pd_scheduler.cpp:69] Instance info: instance name = 172.20.185.82:14830, instance rpc_address = 172.20.185.82:7777, instance type = PREFILL
I20251022 16:26:21.218425 141329 xservice_client.cpp:183] Success register instance to etcd.
I20251022 16:26:21.239382 141329 jinja_chat_template.cpp:30] Jinja chat template init succeed.
I20251022 16:26:21.586486 141329 server.cpp:1200] Server[xllm::APIService] is serving on port=14830.
I20251022 16:26:21.586807 141329 server.cpp:1203] Check out http://pod-1427689232471265280:14830 in web browser.
I20251022 16:26:21.586951 141329 xllm_server.cpp:59] Brpc Server started on port 14830, idle_timeout_sec: -1, num_threads: 32, max_concurrency: 0
W20251022 18:08:59.582523 146035 disagg_pd_scheduler.cpp:144] Failed to find channel to instance: 172.20.56.213:14830, try to create channel now.
W20251022 19:38:33.252043 141482 input_messenger.cpp:387] Fail to read from Socket{id=50 fd=114 addr=172.20.185.82:28889:39850} (0x496bff50): Connection reset by peer [104]
E20251022 19:38:33.252705 145898 xservice_client.cpp:362] Failed to send heartbeat to xservice 172.20.185.82:28889, error msg is: [E104]Fail to read from Socket{id=50fd=114 addr=172.20.185.82:28889:39850} (0x0x496bff50): Connection reset by peer [R1][E111]Fail to connect Socket{id=17179872765 addr=172.20.185.82:28889} (0x0xffff1020a160): Connection refused [R2][E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R3][E112]Not connected to 172.20.185.82:28889 yet, server_id=49
I20251022 19:38:33.352806 141483 socket.cpp:2591] Checking Socket{id=49 addr=172.20.185.82:28889} (0x496bfc50)
E20251022 19:38:33.753262 145898 xservice_client.cpp:362] Failed to send heartbeat to xservice 172.20.185.82:28889, error msg is: [E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R1][E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R2][E112]Not connected to 172.20.185.82:28889 yet, server_id=49 [R3][E112]Not connected to 172.20.185.82:28889 yet, server_id=49
...
I service 2 nodes with 16 cards as a Prefill instance and 2 nodes with 16 cards as a Decode instance. Here's the script:
export PYTHON_INCLUDE_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"
export PYTHON_LIB_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"
export PYTORCH_NPU_INSTALL_PATH=/usr/local/libtorch_npu/
export PYTORCH_INSTALL_PATH="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LIBTORCH_ROOT="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LD_LIBRARY_PATH=/usr/local/libtorch_npu/lib:$LD_LIBRARY_PATH
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASDOPS_LOG_TO_STDOUT=1
export ASDOPS_LOG_LEVEL=ERROR
export ATB_LOG_TO_STDOUT=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export NPU_MEMORY_FRACTION=0.98
export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3
export ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
export OMP_NUM_THREADS=12
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_OP_EXPANSION_MODE="AIV"
\rm -rf /root/atb/log/
\rm -rf /root/ascend/log/
\rm -rf core.*
MODEL_PATH="/workspace/cache/grtest-zz/cmss-hongfugui/deepseek-r1-0528-mtp"
MASTER_NODE_ADDR="172.20.185.82:9590"
START_PORT=14830
START_LISTEN_PORT=26000
START_DEVICE=0
LOG_DIR="log/deepseek_w8a8/p1/node1"
NNODES=8
WORLD_SIZE=16
export HCCL_IF_BASE_PORT=43439
\rm -rf "$LOG_DIR"
mkdir -p "$LOG_DIR"
for (( i=0; i<$NNODES; i++ ))
do
PORT=$((START_PORT + i))
DEVICE=$((START_DEVICE + i))
LOG_FILE="$LOG_DIR/node_$i.log"
DEVICE_IP=$(hccn_tool -i $i -ip -g | grep ipaddr | awk -F ':' '{print $2}')
LISTEN_PORT=$((START_LISTEN_PORT + i))
xllm \
--model $MODEL_PATH \
--model_id deepseek \
--port $PORT \
--devices="npu:$DEVICE" \
--master_node_addr=$MASTER_NODE_ADDR \
--nnodes=$WORLD_SIZE \
--node_rank=$i \
--device_ip $DEVICE_IP \
--max_memory_utilization=0.9 \
--max_tokens_per_batch=20000 \
--max_seqs_per_batch=3000 \
--block_size=128 \
--enable_prefix_cache=false \
--enable_chunked_prefill=false \
--communication_backend="hccl" \
--enable_schedule_overlap=true \
--enable_mla=true \
--dp_size=4 \
--ep_size=4 \
--rank_tablefile /workspace/cache/grtest-zz/yqh/scripts/gen_ranktable/rank_table.json \
--enable_disagg_pd=true \
--instance_role=PREFILL \
--etcd_addr=172.20.185.82:2379 \
--transfer_listen_port=$LISTEN_PORT \
> $LOG_FILE 2>&1 &
doneThe requests are:
evalscope perf --parallel 1 8 64 128 256 512 --number 5 40 320 640 1280 2560 --model deepseek --url http://127.0.0.1:28888/v1/chat/completions --api openai --dataset random --max-tokens 2048 --min-tokens 2048 --prefix-length 0 --min-prompt-length 7168 --max-prompt-length 7168 --tokenizer-path /workspace/cache/grtest-zz/cmss-hongfugui/deepseek-r1-0528-mtp --extra-args '{"ignore_eos": true}'Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working