Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented May 21, 2016

This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.

This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().

Fixes open-mpi/ompi#1676

Signed-off-by: Nathan Hjelm [email protected]

(cherry picked from commit open-mpi/ompi@a679cc0)

:bot🏷️bug
:bot:assign: @bosilca
:bot:milestone:v2.0.0

This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.

This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().

Fixes open-mpi/ompi#1676

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from commit open-mpi/ompi@a679cc0)
@ompiteam-bot ompiteam-bot added this to the v2.0.0 milestone May 21, 2016
@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1683/ for details.

@jsquyres
Copy link
Member

@hjelmn Let's talk about this on the call tomorrow. In general, I think your approach is correct, but I wouldn't mind seeing the frequency go a bit lower if this really is only for CM kinds of things (e.g., 1 out of every 64 or 128 runs).

@bosilca Per your comment (open-mpi/ompi#1677 (comment)), would you be happier if the frequency is a bit lower?

@jsquyres jsquyres added blocker and removed blocker labels May 23, 2016
@bosilca
Copy link
Member

bosilca commented May 23, 2016

The concept is what we need, but I wonder if there is not a cheaper way to do this. What if we let the BTLs explicitly register their progress function instead of forcing the BML to do it ? If peers exists then the progress will be registered as a recurrent event, otherwise we can have them registered as timed events with a predefined interval.

@jsquyres
Copy link
Member

Per discussion 2016-05-24 teleconf: we'll take this. We might want to improve it after v2.0.0.

@jsquyres
Copy link
Member

Per discussion 2016-05-24 teleconf: I was a little too quick with the prior comment. @hjelmn will add an MCA param to govern the frequency, and default to 1 out of 8 runs. There will be work on master to improve this: open-mpi/ompi#1695.

opal_progress_lp_call_ratio = 8;
ret = mca_base_var_register("opal", "opal", NULL, "progress_lp_call_ratio",
"Ratio of calls to high-priority to low-priority progress "
"functions. Higher numbers decrease the low-priority callback "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweak: "Higher numbers decrease the frequency of the callback rate."

@lanl-ompi
Copy link
Contributor

Test FAILed.

@ibm-ompi
Copy link

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1702/ for details.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1703/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented May 27, 2016

:bot:nolabel:pushed-back

@jsquyres
Copy link
Member

Per 2016-05-31 We're good with this once @hjelmn changes the "if it's not a power of 2" check to make it error out.

hjelmn added 2 commits June 2, 2016 09:35
This commit adds a MCA variable to control how often low-priority
callbacks are made relative to high-priority callbacks. This was added
per discussion on telcon on May 24, 2016.

Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.

Per suggestion we now allocate a default of 8 slots for callbacks and
double the current number when we run out of space.

This commit also fixes leaking the callbacks_lp array.

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@2fad3b9)

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

@jsquyres Made it error out on a bad parameter. Also added the new master commit to make the registration truly thread safe since it is more related to this that it is to the request rework (which will trigger the bug this commit fixes).

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1733/ for details.

@jsquyres
Copy link
Member

jsquyres commented Jun 7, 2016

@hjelmn Will cherry pick one more commit to fix some warnings

This commit fixes several warning introduced by
open-mpi/ompi@fc26d9c .

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@e082ed7)

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 7, 2016

Added the warnings fix. Should be good to go. @bosilca Please review.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1754/ for details.

@bosilca
Copy link
Member

bosilca commented Jun 7, 2016

👍

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2016

bot:retest

@ibm-ompi
Copy link

ibm-ompi commented Jun 9, 2016

Test FAILED

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

@ibm-ompi
Copy link

ibm-ompi commented Jun 9, 2016

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

@ibm-ompi
Copy link

ibm-ompi commented Jun 9, 2016

Test FAILED

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

@ibm-ompi
Copy link

ibm-ompi commented Jun 9, 2016

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

@jjhursey
Copy link
Member

jjhursey commented Jun 9, 2016

IBM failure were due to an internal script failure. It should be resolved now.
bot:retest

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1762/ for details.

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2016

@hppritcha This should now be good to go.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1763/ for details.

@hppritcha hppritcha merged commit bd6fbe7 into open-mpi:v2.x Jun 11, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants