Skip to content

Conversation

@iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Nov 10, 2025

What?

This is a continuation of request handling optimization effort started in #982
In this PR 2 things are optimized:

  • Release all (except one) pending requests right after posting
  • Removed nixlUcxIntReq class, moved connection management to nixlUcxBackendH

Performance results

In nixlbench post time decreases 10x for RDMA batch 64k messages of 512B
PR1: #982
PR2: this PR

NIXLBENCH
nixlbench --initiator_seg_type=VRAM --target_seg_type=VRAM --start_block_size=512 --max_block_size=512 --start_batch_size=64000 --max_batch_size=64000 --warmup_iter=10 --num_iter=100 --progress_threads=8 &
 
# Num_threads=8 512:64k cuda_ipc
Branch  Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)
main    512                 65000          0.122929       4.2            6022.0         6022.0         121784.0
PR1     512                 65000          0.135798       3.8            6433.0         6433.0         103406.5
PR2     512                 64000          0.124752       4.1            8171.0         8171.0         114922.7
 
# Num_threads=8 512:64k rdma
Branch  Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)
main    512                 65000          1.932672       0.3            5540.0         5540.0         13065.2
PR1     512                 65000          2.787826       0.2            5583.0         5583.0         8764.7
PR2     512                 64000          6.710469       0.1            5871.0         5871.0         895.1
SGLANG TTFT
Size  MC     main   PR1    PR2 
1     28     25     22     20
2     47     54     45     45
4     125    129    115    111 
8     243    378    346    332
16    442    699    647    522
32    831    1433   1001   904 
64    1364   2032   1853   1804
128   2583   3869   3611   3060 
256   5232   6618   6083   5501
512   10469  12805  12440  10170
1024  22521  24990  22800  20392
output

@github-actions
Copy link

👋 Hi iyastreb! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@iyastreb
Copy link
Contributor Author

/build

@rakhmets
Copy link
Contributor

/build

if (__builtin_expect(result.req != nullptr, 1)) {
ucp_request_free(result.req);
}
result.req = req;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we use this request? It can be returned to memory pool by UCX at any moment after the free

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see we don't use freed request at all.
Instead, the idea is to keep the LAST pending (or incomplete) request. So when we detect that current request is in pending state, we free the previously stored pending request (cause now we have a more recent one), and remember the recent one.

Later on we use this last pending request in "waiting for completion" stage (checkXfer/status) in order to:

  • detect whether request completed
  • error handling
    In both cases, request is returned back to the UCX (either in status() -> worker->reqRelease() or in release() -> worker->reqRelease())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants