Skip to content

Conversation

@ggouaillardet
Copy link
Contributor

Add the new mpi_poll_when_idle and mpi_poll_threhold MCA parameters to
control if and when the progress loop should poll() when idle and
when polling should start.

The default is not to poll when idle.

Thanks Paul Kapinos for bringing this to our attention

Signed-off-by: Gilles Gouaillardet [email protected]

@ggouaillardet
Copy link
Contributor Author

@jsquyres can you please review this ?

opal_progress_event_flag));
OPAL_OUTPUT((debug_output, "progress: initialized yield_when_idle to: %s",
opal_progress_yield_when_idle ? "true" : "false"));
OPAL_OUTPUT((debug_output, "progress: initialized poll_when_idle to: %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to set both poll_when_idle and yield_when_idle to true?

I'm wondering if we should have a higher-level abstraction name -- i.e., use yield_when_idle to mean that the user intends to yield the processor when MPI is idle (regardless of whether we use yield() or some other mechanism). Otherwise, we have 2 mechanisms that do (kinda) the same thing, and one of them may not be supported on all platforms. I think it would be better to have one MCA param that intelligently tries to give up the processor when MPI is "idle".

} else {
yield_count = 0;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, per the comment above, we end up squashing both mechanisms into the same MCA variable, you should probably squash these two blocks together (i.e., the yield() and poll() blocks) so that we only have to have a single if statement (probably should be with an OPAL_UNLIKELY) gating entrance into that block.

/* do we want to call sched_yield() if nothing happened */
bool opal_progress_yield_when_idle = false;
/* do we want to poll() if nothing happened for a while */
uint32_t opal_progress_poll_threshold = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you chose this number fairly arbitrarily.

Any idea how it impacts real application performance?

Put differently: any idea how long it takes to go through 1,000 cycles of the progress engine in an "idle" MPI process?

My guess is that 1,000 is going to be too low (i.e., we'll go into poll() mode too quickly, which could impact latency-sensitive applications).

@rhc54
Copy link
Contributor

rhc54 commented Oct 12, 2017

I'm no expert on poll, but I am curious to know how calling poll with a NULL argument will impact the event library.

@ggouaillardet
Copy link
Contributor Author

ggouaillardet commented Oct 13, 2017

@jsquyres i merged both mechanisms per your comment

mpirun --mca mpi_yield_when_idle true ...

will simply sched_yield(), in order to start sleeping after a while, just do

mpirun --mca mpi_yield_when_idle true --mca mpi_sleep_when_idle_threshold <value> ...

with a positive value.

@rhc54 i am not sure i fully understand your concern.
as far as i understand poll(NULL, 0, 1) is just a way to sleep for 1 millisecond.
usleep() could be used (select() could be even used here), but i am not sure it is available on all platforms.
or are you saying we should return from opal_progress() asap and use libevent timeout instead ?

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2017

Your understanding matches my own - it was your comments in this PR that caused my confusion. You seemed to imply that somehow we were using the threshold to start polling file descriptors, but that isn't what you were doing at all - you're just sleeping to cause the scheduler to kick us out. It was very confusing.

@ggouaillardet
Copy link
Contributor Author

ok, i will do some rewording use sleep instead of poll and add a note on how sleep is implemented.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should talk about the algorithm we want. I have a dim recollection that there are some papers on this kind of topic...?

I think that in my head, I was assuming the algorithm would be something like:

  1. If no events are returned after N consecutive (*) calls to opal_progress()
    1. If sched_yield() is available:
      1. For the next additional M consecutive (*) calls to opal_progress() where 0 events are returned, call sched_yield()
      2. Further consecutive (*) calls to opal_progress() where 0 events are returned call poll() to "sleep"
    2. If sched_yield() is NOT available:
      1. Further consecutive (*) calls to opal_progress() where 0 events are returned call poll() to "sleep"

Loosely put:

  • if nothing is happening, try calling sched_yield() for a while
  • if nothing continues to happen, switch to calling poll()

(*) The implementation of the definition of "consecutive" could be tricky. Specifically: we only want this behavior if opal_progress() is being repeatedly called while waiting for something specific to happen. For example, we don't want this behavior to occur if the user calls MPI_TEST N (or (N+M)) times (each of which will call opal_progress() once) and it just happens that no events occur -- but OMPI will see N calls to opal_progress() and therefore the (N+1)th call, it'll sched_yield(), and will call poll() on the (N+M+1)th time.

Perhaps this functionality should be in some upper-layer function that is looping on calling opal_progress() rather than inside opal_progress() itself?


ompi_mpi_sleep_when_idle_threshold = -1;
(void) mca_base_var_register("ompi", "mpi", NULL, "sleep_when_idle_threshold",
"Sleep after waiting for MPI communication too long",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to put "MPI" in the help message for an OPAL MCA var... Can this be re-worded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically a copy/paste of yield_when_idle, i fixed both

(void) mca_base_var_register("ompi", "mpi", NULL, "sleep_when_idle_threshold",
"Sleep after waiting for MPI communication too long",
MCA_BASE_VAR_TYPE_INT, NULL, 0, 0,
OPAL_INFO_LVL_9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My $0.02: make this level 6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (and i updated yield_when_idle)

Add the new mpi_sleep_when_idle_threshold MCA parameter.
This is only relevant when mpi_yield_when_idle is set.
-1 value (default) means never pool when idle
0 value means always sleep when idle
a positive n value means means opal_progress() will sleep after
being invoked n times in a row and no event was available.

The default is not to sleep when idle.

Note the sleep funcitonality is implemented as poll(NULL, 0, 1)

Thanks Paul Kapinos for bringing this to our attention

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor Author

@jsquyres i share the same vision.
just to be clear, what should be the default (e.g. spin only vs spin then yield then poll) ? should it depend whether we oversubscribe or not ?

you have a good point with respect to MPI_Test, and i guess the same applies to MPI_Iprobe and MPI_Improbe (are we missing some more ?)

@jsquyres
Copy link
Member

@ggouaillardet Yes, perhaps the demarkation line should be exiting the MPI library. I.e., when MPI_TEST (or MPI_IPROBE or MPI_ISEND or ...) returns, the counters -- or whatever measures of "contiguous" we use -- should be reset.

As for what the default should be, I'm not sure. I have dim recollections of some vendor MPI touting the power efficiencies of doing spin-then-yield by default a while ago (which is a dubious claim at best -- if your program is doing nothing for so long such that you frequently get into "MPI can yield the processor without harming performance" scenarios, then your program is not efficient to begin with, and therefore any power efficiencies gained by spin-then-yield probably mean that you're only wasting less energy than you were before).

This is probably a topic best discussed by the community -- others may have direct experience with this kind of thing. @bosilca @gpapaure @jjhursey @edgargabriel @artpol84 @jladd-mlnx @rhc54 ...etc. -- anyone have an opinion here?

ompi_mpi_yield_when_idle = false;
(void) mca_base_var_register("ompi", "mpi", NULL, "yield_when_idle",
"Yield the processor when waiting for MPI communication (for MPI processes, will default to 1 when oversubscribing nodes)",
"Yield the processor when waiting for communication (for MPI processes, will default to 1 when oversubscribing nodes)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still mention MPI in here, and also mention oversubscription (but the code doesn't check for oversubscription (per comments, we're still discussing what the default should be -- I just want to make sure that we don't forget to update the help message when a decision is made).

@rhc54
Copy link
Contributor

rhc54 commented Oct 16, 2017

I can't speak to the performance issue, but I have seen a vendor make that claim. We have repeatedly gotten questions raised on the mailing list when users are "surprised" to see 100% cpu utilization, and several of us have gone to significant lengths to explain why it isn't an issue over the years.

I'd say just make it the default to idle so we quit having to explain it, but I'm not by any means sold on that position.

@bosilca
Copy link
Member

bosilca commented Oct 16, 2017

  1. This patch delays the low priority callbacks (which contributes to counting the events) by pausing before giving them a chance to trigger.
  2. It also fails to protect itself in multithreaded scenarios, unlike most of the surrounding code.
  3. Why are we having the MCA parameters at the MPI level, so that we add accessors to OPAL so that we can force a non-coordinated behavior on the progress. Why not having everything at a consistent level, down in OPAL ?

@hjelmn
Copy link
Member

hjelmn commented Dec 5, 2017

@ggouaillardet I have seen better performance when using nanosleep to put the current thread to sleep. Might be worth looking at that vs poll. The Linux implementation seems pretty fast.

@artpol84
Copy link
Contributor

artpol84 commented Dec 6, 2017

@ggouaillardet what was the original issue? Can you provide the reference?
We have seen issues with Slurm daemons starving of CPUs when spinning in the direct modex case. In this case some processes may have all they need and go calculating while the remote procs may be asking for the EP and the local PMIx server is delayed to respond because it is pushed back by the app procs.

@ggouaillardet
Copy link
Contributor Author

@artpol84 please refer to the thread starting at https://www.mail-archive.com/[email protected]//msg20407.html

long story short, if we while (...) sched_yield();, then top reports 100% usage even if the system remains very responsive since the MPI app spends its time yielding. The goal of this PR is to (virtually) stop CPU usage when nothing is happening.
in the case of MPI paraview (which is an interactive program), MPI tasks spend most of the time in sched_yield(), which can be confusing when running top output that states 100% usage.

@lanl-ompi
Copy link
Contributor

Can one of the admins verify this patch?

@gpaulsen
Copy link
Member

@bosilca do we want this for v5.0?

@bosilca
Copy link
Member

bosilca commented Aug 30, 2022

The idea is still of interest, but this PR is stale. If I summarize, we want to have a straightforward approach: after X unsuccessful polls we start yielding and then after another Y unsuccessful polls we nanosleep for a duration Z.

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2023

@ggouaillardet You have several PRs that date back 3-6 years - would it make sense for you to triage them and close the ones not worth rebasing, fixing, resubmitting for review, and finally committing?

@hppritcha
Copy link
Member

can this PR be closed? this is almost 8 years stale now.

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2025

This PR is old and stale, and would need to be fully refreshed. I think we should close it; someone can open a new PR if they want to advance this idea.

Additionally, I'm a little unconvinced by #4331 (comment) -- there needs to be a distinction between periodically calling MPI_TEST N times (for example) and blocking in something like MPI_WAIT (per #4331 (review)).

@jsquyres jsquyres closed this Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants