Skip to content

Conversation

@michal-shalev
Copy link

What?

This PR randomize the order in which endpoints (EPs) are created in the mca_spml_ucx_add_procs function. Each new EP is placed at a random position instead of cyclic order.

Why?

Recently, a customer raised concerns about incast behavior during SHMEM quiet operations, where many EPs communicating in a fixed order could lead to network congestion and performance degradation. They believe that randomizing the order of EPs could reduce the likelihood of incast collisions.
The customer is interested in testing a patch to address this.

How?

Used the Fisher-Yates shuffle algorithm to randomize indices and modified the loop to handle all indices, including 0, within a single pass.

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b734683: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

4d562ac: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

2345c1f: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

489dbcf: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

eb303bf: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

25c6b86: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@github-actions
Copy link

github-actions bot commented Nov 4, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

893a73e: spml/ucx: shuffle EPs creation

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Signed-off-by: Michal Shalev <mshalev.nvidia.com>
Copy link
Contributor

@tvegas1 tvegas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

}
}

indices = malloc(nprocs * sizeof(int));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: sizeof(*indices)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/* Get the EP connection requests for all the processes from modex */
for (n = 0; n < nprocs; ++n) {
i = (my_rank + n) % nprocs;
for (i = nprocs - 1; i >= 0; --i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe perform randomization as a separate function/loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to iterate over the EPs once, and to save another iteration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for me, but i guess one extra loop on sequential memory on ep creation would not impact perf

&mca_spml_ucx_ctx_default.ucp_peers[i].ucp_conn);
&mca_spml_ucx_ctx_default.ucp_peers[indices[i]].ucp_conn);
if (UCS_OK != err) {
SPML_UCX_ERROR("ucp_ep_create(proc=%zu/%zu) failed: %s", n, nprocs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update log line index (i and indices[i])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice I'm using proc_index instead of i.
I changed n to proc_index, why indices[proc_index]? the loop still iterates until nproc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, already up-to-date. we could add indices[proc_index] to the log along with iteration since it is randomized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Michal Shalev <mshalev.nvidia.com>
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
ucp_address_t **wk_local_addr;
unsigned int *wk_addr_len;
ucp_ep_params_t ep_params;
int *indices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned int?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please notice proc_index >= 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly..

Signed-off-by: Michal Shalev <mshalev.nvidia.com>
Signed-off-by: Michal Shalev <mshalev.nvidia.com>
@gleon99
Copy link
Contributor

gleon99 commented Mar 2, 2025

The customer is not responsive (e-mail discussion),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants