-
Couldn't load subscription status.
- Fork 929
mtl/psm2: do not set PSM2_DEVICES env variable #1578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mtl/psm2: do not set PSM2_DEVICES env variable #1578
Conversation
The PSM2 MTL was setting the PSM2_DEVICES environment variable to self,shm for single node jobs. This is causing single node/single process jobs to fail on some omnipath systems. Not setting this environment variable fixes these issues. This fix is needed as part of bringup of omnipath clusters at several customer sites. Fixes issue open-mpi#1559 Signed-off-by: Howard Pritchard <[email protected]>
|
@hppritcha The psm2 shared memory performance is already pretty bad. Does this make is worse? |
|
Doubtful. But intel says if you ask for HFI for single mode jobs probably can't run as many ranks as cores - probably for BWL or the likes. Iterating with intel on this. Meanwhile TOSS took patch for now. |
|
Hi @hppritcha, |
|
@matcabral thanks for the detailed answer. this helps a lot. Given this though can you explain why its not good to get rid of this PSM2_DEVICES env setting in the psm2 mtl? |
|
@hppritcha, |
|
closing this. looks like a better fix is available. |
|
reopening in case may be useful for @matcabral |
|
@hppritcha , having fixes for both of the problems reported in #1159 (root cause fix) there is no need for this PR. However, would still be convenient to check the environment before setting PSM2_DEVICES. I will send a new PR for that. thanks, |
|
Test passed. |
|
closing this PR, fixed elsewhere. |
The PSM2 MTL was setting the PSM2_DEVICES environment
variable to self,shm for single node jobs. This is
causing single node/single process jobs to fail on
some omnipath systems. Not setting this environment
variable fixes these issues.
This fix is needed as part of bringup of omnipath
clusters at several customer sites.
Fixes issue #1559
@matcabral
@rhc54 (copying you just for fyi)
Signed-off-by: Howard Pritchard [email protected]