-
Notifications
You must be signed in to change notification settings - Fork 928
spml/ucx: shuffle EPs creation #12907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: b734683: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
b734683 to
4d562ac
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 4d562ac: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
4d562ac to
2345c1f
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 2345c1f: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
2345c1f to
489dbcf
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 489dbcf: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
489dbcf to
eb303bf
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: eb303bf: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
eb303bf to
25c6b86
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 25c6b86: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
25c6b86 to
893a73e
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 893a73e: spml/ucx: shuffle EPs creation
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
893a73e to
00610cc
Compare
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
00610cc to
03c59cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments
oshmem/mca/spml/ucx/spml_ucx.c
Outdated
| } | ||
| } | ||
|
|
||
| indices = malloc(nprocs * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: sizeof(*indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
oshmem/mca/spml/ucx/spml_ucx.c
Outdated
| /* Get the EP connection requests for all the processes from modex */ | ||
| for (n = 0; n < nprocs; ++n) { | ||
| i = (my_rank + n) % nprocs; | ||
| for (i = nprocs - 1; i >= 0; --i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe perform randomization as a separate function/loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to iterate over the EPs once, and to save another iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for me, but i guess one extra loop on sequential memory on ep creation would not impact perf
oshmem/mca/spml/ucx/spml_ucx.c
Outdated
| &mca_spml_ucx_ctx_default.ucp_peers[i].ucp_conn); | ||
| &mca_spml_ucx_ctx_default.ucp_peers[indices[i]].ucp_conn); | ||
| if (UCS_OK != err) { | ||
| SPML_UCX_ERROR("ucp_ep_create(proc=%zu/%zu) failed: %s", n, nprocs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also update log line index (i and indices[i])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice I'm using proc_index instead of i.
I changed n to proc_index, why indices[proc_index]? the loop still iterates until nproc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, already up-to-date. we could add indices[proc_index] to the log along with iteration since it is randomized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
| ucp_address_t **wk_local_addr; | ||
| unsigned int *wk_addr_len; | ||
| ucp_ep_params_t ep_params; | ||
| int *indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please notice proc_index >= 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly..
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
c421b4a to
2ad0f14
Compare
|
The customer is not responsive (e-mail discussion), |
What?
This PR randomize the order in which endpoints (EPs) are created in the
mca_spml_ucx_add_procsfunction. Each new EP is placed at a random position instead of cyclic order.Why?
Recently, a customer raised concerns about incast behavior during SHMEM quiet operations, where many EPs communicating in a fixed order could lead to network congestion and performance degradation. They believe that randomizing the order of EPs could reduce the likelihood of incast collisions.
The customer is interested in testing a patch to address this.
How?
Used the Fisher-Yates shuffle algorithm to randomize indices and modified the loop to handle all indices, including 0, within a single pass.