-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Describe the bug
After using #3145, i still get the error. The error details is as follows:
-------------------------------------------------log start----------------------------------------------------------------
[ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372264][169925][rdma_endpoint.cpp:895] Fail to ibv_post_send: Cannot allocate memory, window=1, sq_current=16
[ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372387][169925][socket.cpp:1841] Fail to keep-write into Socket{id=861 fd=787 addr=10.39.61.118:55530:19336} (0x7f152db0ec40): Cannot allocate memory [12]
[ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.494266][170049][rdma_endpoint.cpp:578] Fail to read Hello Message from client:brpc::Socket{id=1055 fd=787 addr=10.39.61.118:38306:19336} (0x7f15035dd7c0) 10.39.61.118:38306: Unknown error 1014 [1014]
[ps-57] [ UNKNOWN][UNKNOWN][F][2026-01-26 09:05:16.457825][169792][ps_server.cc:265] Fail to pull with request_id=198536955979 WK=12 cache_id=0 global_cache_id=18860592829102089 retry_count=1
[ps-57] *** Check failure stack trace: ***
[ps-57] [ UNKNOWN][UNKNOWN][I][2026-01-26 09:05:18.885514][170075][block_pool.cpp:199] Start extend rdma memory 1024MB
[ps-57] MiniDump path: /tmp/fbb13d5d-6993-4de5-80903f9b-281ee912.dmp
------------------------------------------------log end-------------------------------------------------------------------
To Reproduce
The task i run is model training. It includes 11 cpu machines(500C2000G) and 7 gpu machines(380C2200G8GPU).
- every cpu machine has 2 emb server, one emb server for one numa;
- every gpu machine has 8 gpus, there are one dense server、one sparse server and one worker for one gpu;
- only gpu machines can use rdma. RDMA is used for communication between workers and emb ps in gpu machines; GDR is used for communication between workers and dense ps in gpu machines.
- TCP is used for communication between workers and emb ps in cpu machines.