-
Notifications
You must be signed in to change notification settings - Fork 930
optionally passive wait when the progress loop is idle for a while #4331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jsquyres can you please review this ? |
opal/runtime/opal_progress.c
Outdated
| opal_progress_event_flag)); | ||
| OPAL_OUTPUT((debug_output, "progress: initialized yield_when_idle to: %s", | ||
| opal_progress_yield_when_idle ? "true" : "false")); | ||
| OPAL_OUTPUT((debug_output, "progress: initialized poll_when_idle to: %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to set both poll_when_idle and yield_when_idle to true?
I'm wondering if we should have a higher-level abstraction name -- i.e., use yield_when_idle to mean that the user intends to yield the processor when MPI is idle (regardless of whether we use yield() or some other mechanism). Otherwise, we have 2 mechanisms that do (kinda) the same thing, and one of them may not be supported on all platforms. I think it would be better to have one MCA param that intelligently tries to give up the processor when MPI is "idle".
| } else { | ||
| yield_count = 0; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If, per the comment above, we end up squashing both mechanisms into the same MCA variable, you should probably squash these two blocks together (i.e., the yield() and poll() blocks) so that we only have to have a single if statement (probably should be with an OPAL_UNLIKELY) gating entrance into that block.
opal/runtime/opal_progress.c
Outdated
| /* do we want to call sched_yield() if nothing happened */ | ||
| bool opal_progress_yield_when_idle = false; | ||
| /* do we want to poll() if nothing happened for a while */ | ||
| uint32_t opal_progress_poll_threshold = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you chose this number fairly arbitrarily.
Any idea how it impacts real application performance?
Put differently: any idea how long it takes to go through 1,000 cycles of the progress engine in an "idle" MPI process?
My guess is that 1,000 is going to be too low (i.e., we'll go into poll() mode too quickly, which could impact latency-sensitive applications).
|
I'm no expert on |
1c6ae86 to
27b83fd
Compare
|
@jsquyres i merged both mechanisms per your comment will simply with a positive value. @rhc54 i am not sure i fully understand your concern. |
|
Your understanding matches my own - it was your comments in this PR that caused my confusion. You seemed to imply that somehow we were using the threshold to start polling file descriptors, but that isn't what you were doing at all - you're just sleeping to cause the scheduler to kick us out. It was very confusing. |
|
ok, i will do some rewording use |
27b83fd to
0bd283f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should talk about the algorithm we want. I have a dim recollection that there are some papers on this kind of topic...?
I think that in my head, I was assuming the algorithm would be something like:
- If no events are returned after N consecutive (*) calls to
opal_progress()- If
sched_yield()is available:- For the next additional M consecutive (*) calls to
opal_progress()where 0 events are returned, callsched_yield() - Further consecutive (*) calls to
opal_progress()where 0 events are returned callpoll()to "sleep"
- For the next additional M consecutive (*) calls to
- If
sched_yield()is NOT available:- Further consecutive (*) calls to
opal_progress()where 0 events are returned callpoll()to "sleep"
- Further consecutive (*) calls to
- If
Loosely put:
- if nothing is happening, try calling
sched_yield()for a while - if nothing continues to happen, switch to calling
poll()
(*) The implementation of the definition of "consecutive" could be tricky. Specifically: we only want this behavior if opal_progress() is being repeatedly called while waiting for something specific to happen. For example, we don't want this behavior to occur if the user calls MPI_TEST N (or (N+M)) times (each of which will call opal_progress() once) and it just happens that no events occur -- but OMPI will see N calls to opal_progress() and therefore the (N+1)th call, it'll sched_yield(), and will call poll() on the (N+M+1)th time.
Perhaps this functionality should be in some upper-layer function that is looping on calling opal_progress() rather than inside opal_progress() itself?
ompi/runtime/ompi_mpi_params.c
Outdated
|
|
||
| ompi_mpi_sleep_when_idle_threshold = -1; | ||
| (void) mca_base_var_register("ompi", "mpi", NULL, "sleep_when_idle_threshold", | ||
| "Sleep after waiting for MPI communication too long", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to put "MPI" in the help message for an OPAL MCA var... Can this be re-worded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically a copy/paste of yield_when_idle, i fixed both
ompi/runtime/ompi_mpi_params.c
Outdated
| (void) mca_base_var_register("ompi", "mpi", NULL, "sleep_when_idle_threshold", | ||
| "Sleep after waiting for MPI communication too long", | ||
| MCA_BASE_VAR_TYPE_INT, NULL, 0, 0, | ||
| OPAL_INFO_LVL_9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My $0.02: make this level 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (and i updated yield_when_idle)
Add the new mpi_sleep_when_idle_threshold MCA parameter. This is only relevant when mpi_yield_when_idle is set. -1 value (default) means never pool when idle 0 value means always sleep when idle a positive n value means means opal_progress() will sleep after being invoked n times in a row and no event was available. The default is not to sleep when idle. Note the sleep funcitonality is implemented as poll(NULL, 0, 1) Thanks Paul Kapinos for bringing this to our attention Signed-off-by: Gilles Gouaillardet <[email protected]>
0bd283f to
61f34e6
Compare
|
@jsquyres i share the same vision. you have a good point with respect to |
|
@ggouaillardet Yes, perhaps the demarkation line should be exiting the MPI library. I.e., when MPI_TEST (or MPI_IPROBE or MPI_ISEND or ...) returns, the counters -- or whatever measures of "contiguous" we use -- should be reset. As for what the default should be, I'm not sure. I have dim recollections of some vendor MPI touting the power efficiencies of doing spin-then-yield by default a while ago (which is a dubious claim at best -- if your program is doing nothing for so long such that you frequently get into "MPI can yield the processor without harming performance" scenarios, then your program is not efficient to begin with, and therefore any power efficiencies gained by spin-then-yield probably mean that you're only wasting less energy than you were before). This is probably a topic best discussed by the community -- others may have direct experience with this kind of thing. @bosilca @gpapaure @jjhursey @edgargabriel @artpol84 @jladd-mlnx @rhc54 ...etc. -- anyone have an opinion here? |
| ompi_mpi_yield_when_idle = false; | ||
| (void) mca_base_var_register("ompi", "mpi", NULL, "yield_when_idle", | ||
| "Yield the processor when waiting for MPI communication (for MPI processes, will default to 1 when oversubscribing nodes)", | ||
| "Yield the processor when waiting for communication (for MPI processes, will default to 1 when oversubscribing nodes)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still mention MPI in here, and also mention oversubscription (but the code doesn't check for oversubscription (per comments, we're still discussing what the default should be -- I just want to make sure that we don't forget to update the help message when a decision is made).
|
I can't speak to the performance issue, but I have seen a vendor make that claim. We have repeatedly gotten questions raised on the mailing list when users are "surprised" to see 100% cpu utilization, and several of us have gone to significant lengths to explain why it isn't an issue over the years. I'd say just make it the default to idle so we quit having to explain it, but I'm not by any means sold on that position. |
|
|
@ggouaillardet I have seen better performance when using nanosleep to put the current thread to sleep. Might be worth looking at that vs poll. The Linux implementation seems pretty fast. |
|
@ggouaillardet what was the original issue? Can you provide the reference? |
|
@artpol84 please refer to the thread starting at https://www.mail-archive.com/[email protected]//msg20407.html long story short, if we |
|
Can one of the admins verify this patch? |
|
@bosilca do we want this for v5.0? |
|
The idea is still of interest, but this PR is stale. If I summarize, we want to have a straightforward approach: after |
|
@ggouaillardet You have several PRs that date back 3-6 years - would it make sense for you to triage them and close the ones not worth rebasing, fixing, resubmitting for review, and finally committing? |
|
can this PR be closed? this is almost 8 years stale now. |
|
This PR is old and stale, and would need to be fully refreshed. I think we should close it; someone can open a new PR if they want to advance this idea. Additionally, I'm a little unconvinced by #4331 (comment) -- there needs to be a distinction between periodically calling |
Add the new mpi_poll_when_idle and mpi_poll_threhold MCA parameters to
control if and when the progress loop should poll() when idle and
when polling should start.
The default is not to poll when idle.
Thanks Paul Kapinos for bringing this to our attention
Signed-off-by: Gilles Gouaillardet [email protected]