optionally passive wait when the progress loop is idle for a while #4331

ggouaillardet · 2017-10-12T07:25:50Z

Add the new mpi_poll_when_idle and mpi_poll_threhold MCA parameters to
control if and when the progress loop should poll() when idle and
when polling should start.

The default is not to poll when idle.

Thanks Paul Kapinos for bringing this to our attention

Signed-off-by: Gilles Gouaillardet [email protected]

ggouaillardet · 2017-10-12T08:09:09Z

@jsquyres can you please review this ?

opal/runtime/opal_progress.c

jsquyres · 2017-10-12T15:58:18Z

opal/runtime/opal_progress.c

                 opal_progress_event_flag));
    OPAL_OUTPUT((debug_output, "progress: initialized yield_when_idle to: %s",
                 opal_progress_yield_when_idle ? "true" : "false"));
+    OPAL_OUTPUT((debug_output, "progress: initialized poll_when_idle to: %s",


What does it mean to set both poll_when_idle and yield_when_idle to true?

I'm wondering if we should have a higher-level abstraction name -- i.e., use yield_when_idle to mean that the user intends to yield the processor when MPI is idle (regardless of whether we use yield() or some other mechanism). Otherwise, we have 2 mechanisms that do (kinda) the same thing, and one of them may not be supported on all platforms. I think it would be better to have one MCA param that intelligently tries to give up the processor when MPI is "idle".

jsquyres · 2017-10-12T15:59:32Z

opal/runtime/opal_progress.c

+        } else {
+            yield_count = 0;
+        }
+    }


If, per the comment above, we end up squashing both mechanisms into the same MCA variable, you should probably squash these two blocks together (i.e., the yield() and poll() blocks) so that we only have to have a single if statement (probably should be with an OPAL_UNLIKELY) gating entrance into that block.

jsquyres · 2017-10-12T16:02:13Z

opal/runtime/opal_progress.c

 /* do we want to call sched_yield() if nothing happened */
 bool opal_progress_yield_when_idle = false;
+/* do we want to poll() if nothing happened for a while */
+uint32_t opal_progress_poll_threshold = 1000;


I'm guessing you chose this number fairly arbitrarily.

Any idea how it impacts real application performance?

Put differently: any idea how long it takes to go through 1,000 cycles of the progress engine in an "idle" MPI process?

My guess is that 1,000 is going to be too low (i.e., we'll go into poll() mode too quickly, which could impact latency-sensitive applications).

rhc54 · 2017-10-12T16:22:23Z

I'm no expert on poll, but I am curious to know how calling poll with a NULL argument will impact the event library.

ggouaillardet · 2017-10-13T04:32:26Z

@jsquyres i merged both mechanisms per your comment

mpirun --mca mpi_yield_when_idle true ...

will simply sched_yield(), in order to start sleeping after a while, just do

mpirun --mca mpi_yield_when_idle true --mca mpi_sleep_when_idle_threshold <value> ...

with a positive value.

@rhc54 i am not sure i fully understand your concern.
as far as i understand poll(NULL, 0, 1) is just a way to sleep for 1 millisecond.
usleep() could be used (select() could be even used here), but i am not sure it is available on all platforms.
or are you saying we should return from opal_progress() asap and use libevent timeout instead ?

rhc54 · 2017-10-13T04:37:12Z

Your understanding matches my own - it was your comments in this PR that caused my confusion. You seemed to imply that somehow we were using the threshold to start polling file descriptors, but that isn't what you were doing at all - you're just sleeping to cause the scheduler to kick us out. It was very confusing.

ggouaillardet · 2017-10-13T04:41:18Z

ok, i will do some rewording use sleep instead of poll and add a note on how sleep is implemented.

jsquyres

I think we should talk about the algorithm we want. I have a dim recollection that there are some papers on this kind of topic...?

I think that in my head, I was assuming the algorithm would be something like:

If no events are returned after N consecutive (*) calls to opal_progress()
1. If sched_yield() is available:
  1. For the next additional M consecutive (*) calls to opal_progress() where 0 events are returned, call sched_yield()
  2. Further consecutive (*) calls to opal_progress() where 0 events are returned call poll() to "sleep"
2. If sched_yield() is NOT available:
  1. Further consecutive (*) calls to opal_progress() where 0 events are returned call poll() to "sleep"

Loosely put:

if nothing is happening, try calling sched_yield() for a while
if nothing continues to happen, switch to calling poll()

(*) The implementation of the definition of "consecutive" could be tricky. Specifically: we only want this behavior if opal_progress() is being repeatedly called while waiting for something specific to happen. For example, we don't want this behavior to occur if the user calls MPI_TEST N (or (N+M)) times (each of which will call opal_progress() once) and it just happens that no events occur -- but OMPI will see N calls to opal_progress() and therefore the (N+1)th call, it'll sched_yield(), and will call poll() on the (N+M+1)th time.

Perhaps this functionality should be in some upper-layer function that is looping on calling opal_progress() rather than inside opal_progress() itself?

jsquyres · 2017-10-13T17:11:23Z

ompi/runtime/ompi_mpi_params.c


+    ompi_mpi_sleep_when_idle_threshold = -1;
+    (void) mca_base_var_register("ompi", "mpi", NULL, "sleep_when_idle_threshold",
+                                 "Sleep after waiting for MPI communication too long",


It's weird to put "MPI" in the help message for an OPAL MCA var... Can this be re-worded?

this is basically a copy/paste of yield_when_idle, i fixed both

jsquyres · 2017-10-13T17:11:34Z

ompi/runtime/ompi_mpi_params.c

+    (void) mca_base_var_register("ompi", "mpi", NULL, "sleep_when_idle_threshold",
+                                 "Sleep after waiting for MPI communication too long",
+                                 MCA_BASE_VAR_TYPE_INT, NULL, 0, 0,
+                                 OPAL_INFO_LVL_9,


My $0.02: make this level 6.

done (and i updated yield_when_idle)

Add the new mpi_sleep_when_idle_threshold MCA parameter. This is only relevant when mpi_yield_when_idle is set. -1 value (default) means never pool when idle 0 value means always sleep when idle a positive n value means means opal_progress() will sleep after being invoked n times in a row and no event was available. The default is not to sleep when idle. Note the sleep funcitonality is implemented as poll(NULL, 0, 1) Thanks Paul Kapinos for bringing this to our attention Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet · 2017-10-16T08:17:26Z

@jsquyres i share the same vision.
just to be clear, what should be the default (e.g. spin only vs spin then yield then poll) ? should it depend whether we oversubscribe or not ?

you have a good point with respect to MPI_Test, and i guess the same applies to MPI_Iprobe and MPI_Improbe (are we missing some more ?)

jsquyres · 2017-10-16T11:26:25Z

@ggouaillardet Yes, perhaps the demarkation line should be exiting the MPI library. I.e., when MPI_TEST (or MPI_IPROBE or MPI_ISEND or ...) returns, the counters -- or whatever measures of "contiguous" we use -- should be reset.

As for what the default should be, I'm not sure. I have dim recollections of some vendor MPI touting the power efficiencies of doing spin-then-yield by default a while ago (which is a dubious claim at best -- if your program is doing nothing for so long such that you frequently get into "MPI can yield the processor without harming performance" scenarios, then your program is not efficient to begin with, and therefore any power efficiencies gained by spin-then-yield probably mean that you're only wasting less energy than you were before).

This is probably a topic best discussed by the community -- others may have direct experience with this kind of thing. @bosilca @gpapaure @jjhursey @edgargabriel @artpol84 @jladd-mlnx @rhc54 ...etc. -- anyone have an opinion here?

jsquyres · 2017-10-16T11:29:26Z

ompi/runtime/ompi_mpi_params.c

    ompi_mpi_yield_when_idle = false;
    (void) mca_base_var_register("ompi", "mpi", NULL, "yield_when_idle",
-                                 "Yield the processor when waiting for MPI communication (for MPI processes, will default to 1 when oversubscribing nodes)",
+                                 "Yield the processor when waiting for communication (for MPI processes, will default to 1 when oversubscribing nodes)",


You still mention MPI in here, and also mention oversubscription (but the code doesn't check for oversubscription (per comments, we're still discussing what the default should be -- I just want to make sure that we don't forget to update the help message when a decision is made).

rhc54 · 2017-10-16T15:19:05Z

I can't speak to the performance issue, but I have seen a vendor make that claim. We have repeatedly gotten questions raised on the mailing list when users are "surprised" to see 100% cpu utilization, and several of us have gone to significant lengths to explain why it isn't an issue over the years.

I'd say just make it the default to idle so we quit having to explain it, but I'm not by any means sold on that position.

bosilca · 2017-10-16T16:46:46Z

This patch delays the low priority callbacks (which contributes to counting the events) by pausing before giving them a chance to trigger.
It also fails to protect itself in multithreaded scenarios, unlike most of the surrounding code.
Why are we having the MCA parameters at the MPI level, so that we add accessors to OPAL so that we can force a non-coordinated behavior on the progress. Why not having everything at a consistent level, down in OPAL ?

hjelmn · 2017-12-05T23:51:14Z

@ggouaillardet I have seen better performance when using nanosleep to put the current thread to sleep. Might be worth looking at that vs poll. The Linux implementation seems pretty fast.

artpol84 · 2017-12-06T02:29:59Z

@ggouaillardet what was the original issue? Can you provide the reference?
We have seen issues with Slurm daemons starving of CPUs when spinning in the direct modex case. In this case some processes may have all they need and go calculating while the remote procs may be asking for the EP and the local PMIx server is delayed to respond because it is pushed back by the app procs.

ggouaillardet · 2017-12-06T04:09:55Z

@artpol84 please refer to the thread starting at https://www.mail-archive.com/[email protected]//msg20407.html

long story short, if we while (...) sched_yield();, then top reports 100% usage even if the system remains very responsive since the MPI app spends its time yielding. The goal of this PR is to (virtually) stop CPU usage when nothing is happening.
in the case of MPI paraview (which is an interactive program), MPI tasks spend most of the time in sched_yield(), which can be confusing when running top output that states 100% usage.

lanl-ompi · 2020-10-25T21:50:26Z

Can one of the admins verify this patch?

gpaulsen · 2022-08-30T19:04:25Z

@bosilca do we want this for v5.0?

bosilca · 2022-08-30T20:59:09Z

The idea is still of interest, but this PR is stale. If I summarize, we want to have a straightforward approach: after X unsuccessful polls we start yielding and then after another Y unsuccessful polls we nanosleep for a duration Z.

rhc54 · 2023-02-01T23:07:16Z

@ggouaillardet You have several PRs that date back 3-6 years - would it make sense for you to triage them and close the ones not worth rebasing, fixing, resubmitting for review, and finally committing?

hppritcha · 2025-02-06T17:51:56Z

can this PR be closed? this is almost 8 years stale now.

jsquyres · 2025-02-06T18:17:56Z

This PR is old and stale, and would need to be fully refreshed. I think we should close it; someone can open a new PR if they want to advance this idea.

Additionally, I'm a little unconvinced by #4331 (comment) -- there needs to be a distinction between periodically calling MPI_TEST N times (for example) and blocking in something like MPI_WAIT (per #4331 (review)).

ggouaillardet added the ⚠️ WIP-DNM! label Oct 12, 2017

ggouaillardet requested a review from jsquyres October 12, 2017 07:25

jsquyres requested changes Oct 12, 2017

View reviewed changes

ggouaillardet force-pushed the topic/passive_wait branch from 1c6ae86 to 27b83fd Compare October 13, 2017 04:16

ggouaillardet force-pushed the topic/passive_wait branch from 27b83fd to 0bd283f Compare October 13, 2017 05:24

jsquyres requested changes Oct 13, 2017

View reviewed changes

ggouaillardet force-pushed the topic/passive_wait branch from 0bd283f to 61f34e6 Compare October 16, 2017 08:00

jsquyres reviewed Oct 16, 2017

View reviewed changes

bwbarrett added the Target: main label Apr 30, 2019

jsquyres closed this Feb 6, 2025

optionally passive wait when the progress loop is idle for a while #4331

optionally passive wait when the progress loop is idle for a while #4331

Uh oh!

Conversation

ggouaillardet commented Oct 12, 2017

Uh oh!

ggouaillardet commented Oct 12, 2017

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhc54 commented Oct 12, 2017

Uh oh!

ggouaillardet commented Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Oct 13, 2017

Uh oh!

ggouaillardet commented Oct 13, 2017

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggouaillardet commented Oct 16, 2017

Uh oh!

jsquyres commented Oct 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhc54 commented Oct 16, 2017

Uh oh!

bosilca commented Oct 16, 2017

Uh oh!

hjelmn commented Dec 5, 2017

Uh oh!

artpol84 commented Dec 6, 2017

Uh oh!

ggouaillardet commented Dec 6, 2017

Uh oh!

lanl-ompi commented Oct 25, 2020

Uh oh!

gpaulsen commented Aug 30, 2022

Uh oh!

bosilca commented Aug 30, 2022

Uh oh!

rhc54 commented Feb 1, 2023

Uh oh!

hppritcha commented Feb 6, 2025

Uh oh!

jsquyres commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ggouaillardet commented Oct 13, 2017 •

edited

Loading

jsquyres commented Feb 6, 2025 •

edited

Loading