Conversation
Hail Mary to try and get things passing.
This reverts commit 2b48a8d.
This was causing a hang during MPI_Finalize.
|
That's weird. Is it really only happening for serial jobs, or are those just the ones where |
Yeah it genuinely is only for serial jobs. The timeouts for n=2 are a legit deadlock with a helpful traceback and everything. The worst failure mode I observed was the TSFC tests (serial and take ~3 mins) timing out after 2 hours even though the logs said that everything had succeeded. The hang was in the teardown. |
JHopeCollins
left a comment
There was a problem hiding this comment.
Ok. I still think it's bizarre that this is only happening with the serial tests, but seeing as the only failing tests are the hang on nprocs=2 and the stochastic stokes convergence failure we've been seeing, lets get this merged and see if it helps.
* Do not use 'mpiexec -n 1' for serial tests This was causing a hang during MPI_Finalize. * modify timeouts
After some exhausting debugging I think I identified the source of the latest hangs.
The issue was introduced in #4391. It turns out that
mpiexec -n 1 pytest ...can hang, whereasmpiexec -n 2 pytest ...doesn't! To find this I had to SSH into the runner andgdbinto the hanging process where it was spinning inompi_finalize. The fix is therefore just to callpytestinstead ofmpiexec -n 1 pytest.I think that this took so long to find because (a) the error is stochastic, and (b) we have also been getting timeouts due to thermal throttling/oversubscription. I've therefore also increased the timeouts for some steps so we shouldn't mix up slowdowns with hanging.