-
Notifications
You must be signed in to change notification settings - Fork 604
Description
Describe your performance question
Hello Team,
When transmitting over RDMA, Mooncake's default configuration is MC_SLICE_SIZE=65536, MC_MAX_WR=256, and NUM_QP_PER_EP=2. With larger KV cache blocks, these settings lead to suboptimal performance. For requests averaging several megabytes, using such a small SLICE_SIZE introduces significant chunking overhead on the CPU.
In our tests, despite the bonded NIC having a peak bandwidth of 50 GB/s, Mooncake only achieved 37 GB/s. Consequently, we attempted to increase MC_SLICE_SIZE to 1MB. However, in a bonded network interface environment, although setting NUM_QP_PER_EP distributes the QPs evenly across the two sub-devices of the bonded port, a single submitTransferTask fails to trigger its internal pipeline. Instead, it merely performs a single submitPostSend using the first context (associated with the bonded NIC on the same PCIe bridge). As a result, the submitPostSend operation relies on only one QP for transmission, leaving the bandwidth of the second sub-device entirely unutilized, which result in 23GB/s bandwidth.
Expected behavior
When an endpoint has multiple QPs, each submitPostSend call should use all QPs for that batch’s slices, so that:
- Even when a batch is sent in one call (large MC_SLICE_SIZE + default MC_MAX_WR), load is spread across QPs.
- No single QP is overloaded; aggregate capacity (sum of per-QP
max_wr_depth_) is used.
Suggested fix
- In one
submitPostSendcall, distribute slices across QPs in round-robin. We have implemented a prototype based on this approach. By utilizing this round-robin distribution with MC_SLICE_SIZE=1048576 and MC_MAX_WR=256, our transmission performance significantly improved, reaching 45 GB/s.
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation