-
Notifications
You must be signed in to change notification settings - Fork 516
UCP/RNDV: Throttle rndv fragment requests #11062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UCP/RNDV: Throttle rndv fragment requests #11062
Conversation
…ne mtype requests)
2be3040 to
b175673
Compare
| ucp_proto_rndv_mtype_fc_increment(ucp_request_t *req) | ||
| { | ||
| ucp_worker_h worker = req->send.ep->worker; | ||
|
|
||
| if (worker->context->config.ext.rndv_ppln_worker_fc_enable) { | ||
| worker->rndv_ppln_fc.active_frags++; | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Decrement active_frags counter and reschedule pending request if any. | ||
| */ | ||
| static UCS_F_ALWAYS_INLINE void | ||
| ucp_proto_rndv_mtype_fc_decrement(ucp_request_t *req) | ||
| { | ||
| ucp_worker_h worker = req->send.ep->worker; | ||
| ucp_context_h context = worker->context; | ||
|
|
||
| if (!context->config.ext.rndv_ppln_worker_fc_enable) { | ||
| return; | ||
| } | ||
|
|
||
| ucs_assert(worker->rndv_ppln_fc.active_frags > 0); | ||
| worker->rndv_ppln_fc.active_frags--; | ||
|
|
||
| if (!ucs_queue_is_empty(&worker->rndv_ppln_fc.pending_q)) { | ||
| ucp_request_t *pending_req; | ||
| ucs_queue_elem_t *elem; | ||
|
|
||
| elem = ucs_queue_pull(&worker->rndv_ppln_fc.pending_q); | ||
| pending_req = ucs_container_of(elem, ucp_request_t, | ||
| send.rndv.ppln.queue_elem); | ||
| ucs_callbackq_add_oneshot(&worker->uct->progress_q, pending_req, | ||
| ucp_proto_rndv_mtype_fc_reschedule_cb, | ||
| pending_req); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it can be named better, or I should divide it differently
src/ucp/core/ucp_context.c
Outdated
| {"RNDV_MTYPE_WORKER_MAX_FRAGS", "1024", | ||
| "Maximum number of concurrent mtype fragments per worker\n" | ||
| "(only applies when RNDV_MTYPE_WORKER_FC_ENABLE=y)", | ||
| ucs_offsetof(ucp_context_config_t, rndv_mtype_worker_max_frags), UCS_CONFIG_TYPE_ULUNITS}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shasson5 I am wondering if it's better to define RNDV_MTYPE_WORKER_MAX_FRAG_MEM instead, because the user shouldn't know how the amount of available memory is translated to the number of fragments.
The thing is that if we translate it in the code, it should be defined for each memory type (mem_type fragment size).
What?
Added a worker-level throttling mechanism for rndv fragment requests (both pipeline and standalone mtype requests).
Why?
Without throttling, rendezvous protocols can spawn an unbounded number of concurrent fragment requests, each allocating staging memory from GPU memory pools. Under heavy workloads, this leads to GPU memory exhaustion (OOM), as tracked in https://nvbugspro.nvidia.com/bug/5359018. The throttling mechanism bounds the number of in-flight fragments per worker, preventing resource exhaustion while allowing work to progress as fragments complete.
How?
The throttling is implemented using a worker-level flow control state (
worker->rndv_ppln_fc) consisting of:active_fragscounter tracking currently in-flight fragment operationspending_qqueue holding throttled requests waiting for resourcesWhen a new fragment request would exceed
UCX_RNDV_PPLN_WORKER_MAX_FRAGS, the request is queued inpending_q. When fragments complete,active_fragsis decremented and one queued request is rescheduled viaucs_callbackq_add_oneshot()to the main progress loop.The mechanism applies to:
rndv_ppln.cmtypeprotocols inrndv_rtr.c,rndv_get.c, andrndv_put.c.