Skip to content

[WIP] Maint: attempt to mitigate MPITimeout on CI#1242

Open
asoplata wants to merge 1 commit intojonescompneurolab:masterfrom
asoplata:fix/mpi-timeout-mitigation-attempt-2
Open

[WIP] Maint: attempt to mitigate MPITimeout on CI#1242
asoplata wants to merge 1 commit intojonescompneurolab:masterfrom
asoplata:fix/mpi-timeout-mitigation-attempt-2

Conversation

@asoplata
Copy link
Collaborator

This is a shallow attempt to see if a small change mentioned in #774 (comment) is enough to improve the odds of our Unit Test runners passing the particularly problematic MPI test that keeps failing frequently. The only change this makes is increasing the timeout of parallel_backends.py::_get_data_from_child_err from 0.01 to 0.05. This greatly increases the time window during which an mpi_child process during an MPI simulation must return its data, if it has any. As far as I understand it (which is only a little bit at the moment), this is the main way that our MPI child processes communicate actual simulation results to the main process.

When I did some local testing on my own computer (after reducing other timeout values elsewhere in the code), this seemed to have a very good impact on allowing for more MPI simulations to successfully complete.

I don't know if, or what, the negative impacts of this change would be, but considering that it's increasing a time window for inter-process communication from 10 milliseconds to 50, this probably doesn't have any negative impacts.

This is a shallow attempt to see if a small change mentioned in
jonescompneurolab#774 (comment)
is enough to improve the odds of our Unit Test runners passing the
particularly problematic MPI test that keeps failing frequently. The
only change this makes is increasing the timeout of
`parallel_backends.py::_get_data_from_child_err` from 0.01 to 0.05. This
greatly increases the time window during which an `mpi_child`
process during an MPI simulation must return its data, if it has any. As
far as I understand it (which is only a little bit at the moment), this
is the main way that our MPI child processes communicate actual
simulation results to the main process.

When I did some local testing on my own computer (after reducing other
timeout values elsewhere in the code), this seemed to have a very good
impact on allowing for more MPI simulations to successfully complete.

I don't know if, or what, the negative impacts of this change would be,
but considering that it's increasing a time window for inter-process
communication from 10 milliseconds to 50, this probably doesn't have any
negative impacts.
@asoplata
Copy link
Collaborator Author

Still failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant