Skip to content

Conversation

@amastbaum
Copy link
Contributor

What?

Added a worker-level throttling mechanism for rndv fragment requests (both pipeline and standalone mtype requests).

Why?

Without throttling, rendezvous protocols can spawn an unbounded number of concurrent fragment requests, each allocating staging memory from GPU memory pools. Under heavy workloads, this leads to GPU memory exhaustion (OOM), as tracked in https://nvbugspro.nvidia.com/bug/5359018. The throttling mechanism bounds the number of in-flight fragments per worker, preventing resource exhaustion while allowing work to progress as fragments complete.

How?

The throttling is implemented using a worker-level flow control state (worker->rndv_ppln_fc) consisting of:

  • An active_frags counter tracking currently in-flight fragment operations
  • A pending_q queue holding throttled requests waiting for resources

When a new fragment request would exceed UCX_RNDV_PPLN_WORKER_MAX_FRAGS, the request is queued in pending_q. When fragments complete, active_frags is decremented and one queued request is rescheduled via ucs_callbackq_add_oneshot() to the main progress loop.

The mechanism applies to:

  • Pipeline fragment dispatch in rndv_ppln.c
  • Standalone mtype protocols in rndv_rtr.c, rndv_get.c, and rndv_put.c.

@amastbaum amastbaum requested a review from gleon99 December 15, 2025 13:26
@amastbaum amastbaum added the WIP-DNM Work in progress / Do not review label Dec 15, 2025
@gleon99 gleon99 requested a review from shasson5 December 15, 2025 13:32
@amastbaum amastbaum force-pushed the add_throttling_to_rndv_fragment_requests branch from 2be3040 to b175673 Compare December 15, 2025 14:45
@amastbaum amastbaum added Ready for Review and removed WIP-DNM Work in progress / Do not review labels Dec 15, 2025
Comment on lines 214 to 250
ucp_proto_rndv_mtype_fc_increment(ucp_request_t *req)
{
ucp_worker_h worker = req->send.ep->worker;

if (worker->context->config.ext.rndv_ppln_worker_fc_enable) {
worker->rndv_ppln_fc.active_frags++;
}
}

/**
* Decrement active_frags counter and reschedule pending request if any.
*/
static UCS_F_ALWAYS_INLINE void
ucp_proto_rndv_mtype_fc_decrement(ucp_request_t *req)
{
ucp_worker_h worker = req->send.ep->worker;
ucp_context_h context = worker->context;

if (!context->config.ext.rndv_ppln_worker_fc_enable) {
return;
}

ucs_assert(worker->rndv_ppln_fc.active_frags > 0);
worker->rndv_ppln_fc.active_frags--;

if (!ucs_queue_is_empty(&worker->rndv_ppln_fc.pending_q)) {
ucp_request_t *pending_req;
ucs_queue_elem_t *elem;

elem = ucs_queue_pull(&worker->rndv_ppln_fc.pending_q);
pending_req = ucs_container_of(elem, ucp_request_t,
send.rndv.ppln.queue_elem);
ucs_callbackq_add_oneshot(&worker->uct->progress_q, pending_req,
ucp_proto_rndv_mtype_fc_reschedule_cb,
pending_req);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it can be named better, or I should divide it differently

Comment on lines 391 to 394
{"RNDV_MTYPE_WORKER_MAX_FRAGS", "1024",
"Maximum number of concurrent mtype fragments per worker\n"
"(only applies when RNDV_MTYPE_WORKER_FC_ENABLE=y)",
ucs_offsetof(ucp_context_config_t, rndv_mtype_worker_max_frags), UCS_CONFIG_TYPE_ULUNITS},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shasson5 I am wondering if it's better to define RNDV_MTYPE_WORKER_MAX_FRAG_MEM instead, because the user shouldn't know how the amount of available memory is translated to the number of fragments.

The thing is that if we translate it in the code, it should be defined for each memory type (mem_type fragment size).

@gleon99 gleon99 requested a review from nbellalou December 31, 2025 09:43
nbellalou
nbellalou previously approved these changes Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants