[WIP] Maint: attempt to mitigate MPITimeout on CI#1242
Open
asoplata wants to merge 1 commit intojonescompneurolab:masterfrom
Open
[WIP] Maint: attempt to mitigate MPITimeout on CI#1242asoplata wants to merge 1 commit intojonescompneurolab:masterfrom
asoplata wants to merge 1 commit intojonescompneurolab:masterfrom
Conversation
This is a shallow attempt to see if a small change mentioned in jonescompneurolab#774 (comment) is enough to improve the odds of our Unit Test runners passing the particularly problematic MPI test that keeps failing frequently. The only change this makes is increasing the timeout of `parallel_backends.py::_get_data_from_child_err` from 0.01 to 0.05. This greatly increases the time window during which an `mpi_child` process during an MPI simulation must return its data, if it has any. As far as I understand it (which is only a little bit at the moment), this is the main way that our MPI child processes communicate actual simulation results to the main process. When I did some local testing on my own computer (after reducing other timeout values elsewhere in the code), this seemed to have a very good impact on allowing for more MPI simulations to successfully complete. I don't know if, or what, the negative impacts of this change would be, but considering that it's increasing a time window for inter-process communication from 10 milliseconds to 50, this probably doesn't have any negative impacts.
Collaborator
Author
|
Still failed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a shallow attempt to see if a small change mentioned in #774 (comment) is enough to improve the odds of our Unit Test runners passing the particularly problematic MPI test that keeps failing frequently. The only change this makes is increasing the timeout of
parallel_backends.py::_get_data_from_child_errfrom 0.01 to 0.05. This greatly increases the time window during which anmpi_childprocess during an MPI simulation must return its data, if it has any. As far as I understand it (which is only a little bit at the moment), this is the main way that our MPI child processes communicate actual simulation results to the main process.When I did some local testing on my own computer (after reducing other timeout values elsewhere in the code), this seemed to have a very good impact on allowing for more MPI simulations to successfully complete.
I don't know if, or what, the negative impacts of this change would be, but considering that it's increasing a time window for inter-process communication from 10 milliseconds to 50, this probably doesn't have any negative impacts.