Skip to content

Fail to ibv_post_send: Cannot allocate memory #3202

@randomkang

Description

@randomkang

Describe the bug

After using #3145, i still get the error. The error details is as follows:

-------------------------------------------------log start----------------------------------------------------------------
[ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372264][169925][rdma_endpoint.cpp:895] Fail to ibv_post_send: Cannot allocate memory, window=1, sq_current=16
[ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372387][169925][socket.cpp:1841] Fail to keep-write into Socket{id=861 fd=787 addr=10.39.61.118:55530:19336} (0x7f152db0ec40): Cannot allocate memory [12]
[ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.494266][170049][rdma_endpoint.cpp:578] Fail to read Hello Message from client:brpc::Socket{id=1055 fd=787 addr=10.39.61.118:38306:19336} (0x7f15035dd7c0) 10.39.61.118:38306: Unknown error 1014 [1014]
[ps-57] [ UNKNOWN][UNKNOWN][F][2026-01-26 09:05:16.457825][169792][ps_server.cc:265] Fail to pull with request_id=198536955979 WK=12 cache_id=0 global_cache_id=18860592829102089 retry_count=1
[ps-57] *** Check failure stack trace: ***
[ps-57] [ UNKNOWN][UNKNOWN][I][2026-01-26 09:05:18.885514][170075][block_pool.cpp:199] Start extend rdma memory 1024MB
[ps-57] MiniDump path: /tmp/fbb13d5d-6993-4de5-80903f9b-281ee912.dmp
------------------------------------------------log end-------------------------------------------------------------------

To Reproduce
The task i run is model training. It includes 11 cpu machines(500C2000G) and 7 gpu machines(380C2200G8GPU).

  1. every cpu machine has 2 emb server, one emb server for one numa;
  2. every gpu machine has 8 gpus, there are one dense server、one sparse server and one worker for one gpu;
  3. only gpu machines can use rdma. RDMA is used for communication between workers and emb ps in gpu machines; GDR is used for communication between workers and dense ps in gpu machines.
  4. TCP is used for communication between workers and emb ps in cpu machines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions