[Performance]: Current default configuration limits Mooncake Transfer Engine's performance over bonded network interfaces

### Describe your performance question
Hello Team,
When transmitting over RDMA, Mooncake's default configuration is MC_SLICE_SIZE=65536, MC_MAX_WR=256, and NUM_QP_PER_EP=2. With larger KV cache blocks, these settings lead to suboptimal performance. For requests averaging several megabytes, using such a small SLICE_SIZE introduces significant chunking overhead on the CPU.

In our tests, despite the bonded NIC having a peak bandwidth of 50 GB/s, Mooncake only achieved 37 GB/s. Consequently, we attempted to increase MC_SLICE_SIZE to 1MB. However, in a bonded network interface environment, although setting NUM_QP_PER_EP distributes the QPs evenly across the two sub-devices of the bonded port, a single submitTransferTask fails to trigger its internal pipeline. Instead, it merely performs a single submitPostSend using the first context (associated with the bonded NIC on the same PCIe bridge). As a result, the submitPostSend operation relies on only one QP for transmission, leaving the bandwidth of the second sub-device entirely unutilized, which result in 23GB/s bandwidth.

### Expected behavior
When an endpoint has multiple QPs, **each** `submitPostSend` call should use **all** QPs for that batch’s slices, so that:
- Even when a batch is sent in one call (large MC_SLICE_SIZE + default MC_MAX_WR), load is spread across QPs.
- No single QP is overloaded; aggregate capacity (sum of per-QP `max_wr_depth_`) is used.

### Suggested fix 
- In **one** `submitPostSend` call, distribute slices across QPs in round-robin. We have implemented a prototype based on this approach. By utilizing this round-robin distribution with MC_SLICE_SIZE=1048576 and MC_MAX_WR=256, our transmission performance significantly improved, reaching 45 GB/s.


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Current default configuration limits Mooncake Transfer Engine's performance over bonded network interfaces #1668

Describe your performance question

Expected behavior

Suggested fix

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Current default configuration limits Mooncake Transfer Engine's performance over bonded network interfaces #1668

Description

Describe your performance question

Expected behavior

Suggested fix

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions