Skip to content

Conversation

@wenduwan
Copy link
Contributor

This PR addresses #12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider selection logic(both mtl and btl) in order to support accelerator memory. As described in the issue, this introduced a bug that can cause the wrong ofi provider to be selected, even if the user explicitly includes/excludes the provider name.

This change refactors the selection logic to correctly handle the include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang [email protected]
(cherry picked from commit 29efcef)

This PR addresses open-mpi#12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider
selection logic(both mtl and btl) in order to support accelerator memory.
As described in the issue, this introduced a bug that can cause the wrong
ofi provider to be selected, even if the user explicitly includes/excludes
the provider name.

This change refactors the selection logic to correctly handle the
include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang <[email protected]>
(cherry picked from commit 29efcef)
@shijin-aws
Copy link
Contributor

shijin-aws commented Jan 19, 2024

why don't you backport the FI_REMOTE_COMM addition?

@wenduwan
Copy link
Contributor Author

why don't you backport the FI_REMOTE_COMM addition?

@shijin-aws Added thanks for reminder

@wenduwan
Copy link
Contributor Author

@hppritcha @shijin-aws I removed FI_REMOTE_COMM commit due to #12255

@wenduwan
Copy link
Contributor Author

@tmh97 Tom did you get a chance to verify this change on main branch? We later discovered an issue with commit 3a09492 - we are reverting it. Please exclude it from your test.

@wenduwan wenduwan merged commit 20b79bf into open-mpi:v5.0.x Jan 22, 2024
@tmh97
Copy link

tmh97 commented Jan 22, 2024

@wenduwan giving it a look now! Thanks

@tmh97
Copy link

tmh97 commented Jan 22, 2024

@wenduwan Verified this change! OPX provider is found without issue, I tested using OSU MB's, and with the following options, which are typically preferred for OPX runs --mca mtl ofi --mca btl self,vader

@wenduwan
Copy link
Contributor Author

@tmh97 Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants