Fail to ibv_post_send: Cannot allocate memory

**Describe the bug**

After using https://github.com/apache/brpc/pull/3145, i still get the error. The error details is as follows:

-------------------------------------------------log start----------------------------------------------------------------
[ps-57] [        UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372264][169925][rdma_endpoint.cpp:895] Fail to ibv_post_send: Cannot allocate memory, window=1, sq_current=16
[ps-57] [        UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372387][169925][socket.cpp:1841] Fail to keep-write into Socket{id=861 fd=787 addr=10.39.61.118:55530:19336} (0x7f152db0ec40): Cannot allocate memory [12]
[ps-57] [        UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.494266][170049][rdma_endpoint.cpp:578] Fail to read Hello Message from client:brpc::Socket{id=1055 fd=787 addr=10.39.61.118:38306:19336} (0x7f15035dd7c0) 10.39.61.118:38306: Unknown error 1014 [1014]
[ps-57] [        UNKNOWN][UNKNOWN][F][2026-01-26 09:05:16.457825][169792][ps_server.cc:265] Fail to pull with request_id=198536955979 WK=12 cache_id=0 global_cache_id=18860592829102089 retry_count=1
[ps-57] *** Check failure stack trace: ***
[ps-57] [        UNKNOWN][UNKNOWN][I][2026-01-26 09:05:18.885514][170075][block_pool.cpp:199] Start extend rdma memory 1024MB
[ps-57] MiniDump path: /tmp/fbb13d5d-6993-4de5-80903f9b-281ee912.dmp
------------------------------------------------log end-------------------------------------------------------------------



**To Reproduce**
The task i run is model training. It includes 11 cpu machines(500C2000G) and 7 gpu machines(380C2200G8GPU).
1) every cpu machine has 2 emb server, one emb server for one numa;
2) every gpu machine has 8 gpus, there are one dense server、one sparse server and one worker for one gpu;
3) only gpu machines can use rdma. RDMA is used for communication between workers and emb ps in gpu machines; GDR is used for communication between workers and dense ps in gpu machines.
4) TCP is used for communication between workers and emb ps in cpu machines.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to ibv_post_send: Cannot allocate memory #3202

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to ibv_post_send: Cannot allocate memory #3202

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions