UCP/RNDV: Throttle rndv fragment requests #11062

amastbaum · 2025-12-15T13:26:08Z

What?

Added a worker-level throttling mechanism for rndv fragment requests (both pipeline and standalone mtype requests).

Why?

Without throttling, rendezvous protocols can spawn an unbounded number of concurrent fragment requests, each allocating staging memory from GPU memory pools. Under heavy workloads, this leads to GPU memory exhaustion (OOM), as tracked in https://nvbugspro.nvidia.com/bug/5359018. The throttling mechanism bounds the number of in-flight fragments per worker, preventing resource exhaustion while allowing work to progress as fragments complete.

How?

The throttling is implemented using a worker-level flow control state (worker->rndv_ppln_fc) consisting of:

An active_frags counter tracking currently in-flight fragment operations
A pending_q queue holding throttled requests waiting for resources

When a new fragment request would exceed UCX_RNDV_PPLN_WORKER_MAX_FRAGS, the request is queued in pending_q. When fragments complete, active_frags is decremented and one queued request is rescheduled via ucs_callbackq_add_oneshot() to the main progress loop.

The mechanism applies to:

Pipeline fragment dispatch in rndv_ppln.c
Standalone mtype protocols in rndv_rtr.c, rndv_get.c, and rndv_put.c.

…ne mtype requests)

amastbaum · 2025-12-15T14:47:28Z

src/ucp/rndv/rndv_mtype.inl

+ucp_proto_rndv_mtype_fc_increment(ucp_request_t *req)
+{
+    ucp_worker_h worker = req->send.ep->worker;
+
+    if (worker->context->config.ext.rndv_ppln_worker_fc_enable) {
+        worker->rndv_ppln_fc.active_frags++;
+    }
+}
+
+/**
+ * Decrement active_frags counter and reschedule pending request if any.
+ */
+static UCS_F_ALWAYS_INLINE void
+ucp_proto_rndv_mtype_fc_decrement(ucp_request_t *req)
+{
+    ucp_worker_h worker   = req->send.ep->worker;
+    ucp_context_h context = worker->context;
+
+    if (!context->config.ext.rndv_ppln_worker_fc_enable) {
+        return;
+    }
+
+    ucs_assert(worker->rndv_ppln_fc.active_frags > 0);
+    worker->rndv_ppln_fc.active_frags--;
+
+    if (!ucs_queue_is_empty(&worker->rndv_ppln_fc.pending_q)) {
+        ucp_request_t *pending_req;
+        ucs_queue_elem_t *elem;
+
+        elem        = ucs_queue_pull(&worker->rndv_ppln_fc.pending_q);
+        pending_req = ucs_container_of(elem, ucp_request_t,
+                                       send.rndv.ppln.queue_elem);
+        ucs_callbackq_add_oneshot(&worker->uct->progress_q, pending_req,
+                                  ucp_proto_rndv_mtype_fc_reschedule_cb,
+                                  pending_req);
+    }
+}


maybe it can be named better, or I should divide it differently

amastbaum · 2025-12-15T16:03:22Z

src/ucp/core/ucp_context.c

+  {"RNDV_MTYPE_WORKER_MAX_FRAGS", "1024",
+   "Maximum number of concurrent mtype fragments per worker\n"
+   "(only applies when RNDV_MTYPE_WORKER_FC_ENABLE=y)",
+   ucs_offsetof(ucp_context_config_t, rndv_mtype_worker_max_frags), UCS_CONFIG_TYPE_ULUNITS},


@shasson5 I am wondering if it's better to define RNDV_MTYPE_WORKER_MAX_FRAG_MEM instead, because the user shouldn't know how the amount of available memory is translated to the number of fragments.

The thing is that if we translate it in the code, it should be defined for each memory type (mem_type fragment size).

…_check

amastbaum requested a review from gleon99 December 15, 2025 13:26

amastbaum added the WIP-DNM Work in progress / Do not review label Dec 15, 2025

gleon99 requested a review from shasson5 December 15, 2025 13:32

UCP/RNDV: Throttle rndv fragment requests (both pipeline and standalo…

b175673

…ne mtype requests)

amastbaum force-pushed the add_throttling_to_rndv_fragment_requests branch from 2be3040 to b175673 Compare December 15, 2025 14:45

amastbaum added Ready for Review and removed WIP-DNM Work in progress / Do not review labels Dec 15, 2025

amastbaum commented Dec 15, 2025

View reviewed changes

UCP/RNDV: change 'ppln' prefix to 'mtype'

2ab04e4

amastbaum commented Dec 15, 2025

View reviewed changes

amastbaum added 4 commits December 23, 2025 17:52

UCP/RNDV: prioritize PUT ops over RTR

cc2f67f

UCP/RNDV: unite RTR, PUT and GET in a generic ucp_proto_rndv_mtype_fc…

0243bab

…_check

UCP/RNDV: some fixes

88f8981

Merge branch 'master' into add_throttling_to_rndv_fragment_requests

9ad055e

gleon99 requested a review from nbellalou December 31, 2025 09:43

amastbaum added 4 commits December 31, 2025 12:03

UCP/RNDV: clang

2abf694

UCP/RNDV: some fixes

07d39e4

UCP/RNDV: change fragments distibution across different protos

f539944

UCP/RNDV: some changes

208c313

nbellalou previously approved these changes Jan 1, 2026

View reviewed changes

UCP/RNDV: Added gtest

bfa55cd

amastbaum dismissed nbellalou’s stale review via bfa55cd January 4, 2026 10:06

amastbaum added 4 commits January 4, 2026 15:42

UCP/RNDV: pre-compute max frags

099a795

UCP/RNDV: align max frags to mpool chunk size

48807d8

UCP/RNDV: adjust gtest to MAX_MEM

910e0c8

UCP/RNDV: pre-define stats-counter-array size

264c733

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCP/RNDV: Throttle rndv fragment requests #11062

UCP/RNDV: Throttle rndv fragment requests #11062

Uh oh!

amastbaum commented Dec 15, 2025

Uh oh!

amastbaum Dec 15, 2025

Uh oh!

amastbaum Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UCP/RNDV: Throttle rndv fragment requests #11062

Are you sure you want to change the base?

UCP/RNDV: Throttle rndv fragment requests #11062

Uh oh!

Conversation

amastbaum commented Dec 15, 2025

What?

Why?

How?

Uh oh!

amastbaum Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

amastbaum Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants