Skip to content

Conversation

@msimberg
Copy link
Collaborator

@msimberg msimberg commented Sep 5, 2025

  • Add less drastic workaround for hangs (with usually better performance, but doesn't work for all applications)
  • Add note about slow intra-node communication and workaround to use NICs for intra-node communication

I'm going through our old lists of known issues. I have to check whether the other workarounds/issues are still relevant, so starting with what we know is still relevant.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

preview available: https://docs.tds.cscs.ch/256

Copy link
Collaborator

@jpcoles-cscs jpcoles-cscs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cxil_map error now appears three times, and it also marked as resolved. I think we can remove the first two occurrences? Sometimes I've had this error when using the wrong combination of libfabric and the aws ofi plugin.
The slow comm part is fine.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

preview available: https://docs.tds.cscs.ch/256

@msimberg
Copy link
Collaborator Author

msimberg commented Sep 5, 2025

The cxil_map error now appears three times, and it also marked as resolved. I think we can remove the first two occurrences? Sometimes I've had this error when using the wrong combination of libfabric and the aws ofi plugin. The slow comm part is fine.

Good catch! The one that I just added I've removed. That was a copy-paste mistake and wasn't meant to be duplicated at all...

The resolved and the unresolved cxil_map errors are actually not the same: the first was the system misconfiguration, and the second is one that has been spotted "occasionally" even after the system misconfiguration was fixed. But perhaps either the titles could be changed, or the resolved one can be removed altogether now? It's been fixed for a year now.

@bcumming bcumming added this pull request to the merge queue Sep 8, 2025
Merged via the queue into eth-cscs:main with commit 720fd50 Sep 8, 2025
3 checks passed
@msimberg msimberg deleted the more-mpich-issues branch September 10, 2025 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants