-
Notifications
You must be signed in to change notification settings - Fork 211
Open
Labels
Description
Describe the bug
I've seen the htex MPI test hang in CI many times. I investigated it finally in a pull request, and commented on it (comments following #3800 (comment)), and now I know what to look for have seen it in unrelated PRs against master (https://github.com/Parsl/parsl/actions/runs/14083883279/job/39442860054)
Reproducing my comments here:
I have looked at the logs for that failed test and it looks like the hang is inside an MPI executor test, so I would like to review that a bit more seriously.
The run directory is runinfo/008 inside https://github.com/Parsl/parsl/actions/runs/13796405061/artifacts/2733031351
As an example of suspicious behaviour: htex task 106 is placed into the backlog in manager.log but no worker ever executes that task:
runinfo/008/MPI_TEST/block-0/9a0f7d3b72d8$ grep 106 worker_*.log
returns nothing.
I haven't audited the other tasks in that log for liveness.
If it is only the last task failing, then maybe a race condition around final task handling ("more tasks need to keep arriving to keep backlog tasks scheduled" or something like that?).`
and
I'll note that the final task completion is logged in the same millisecond as the final backlog placement:
2025-03-11 19:34:46.798 parsl.executors.high_throughput.mpi_resource_management:179 24205 Task-Puller [INFO] Not enough resources, placing task 106 into backlog
2025-03-11 19:34:46.798 worker_log:764 24220 MainThread [INFO] All processing finished for executor task 105
My gut then says theres a race condition:
on the task submit side:
T1: decide a task needs to be backlogged
T3: place the task in the backlog
on the results side:
T2: return from task 105, see that backlog is empty, do not schedule empty backlog
and now no event ever happens to schedule the backlog.
To Reproduce
I have not made a reproducer case - I made an attempt but getting a suitable delay in the right place was awkward so I stopped at making the above T1/T2/T3 timeline, for an MPI person to look at this more closely.
Expected behavior
no hang.
Environment
CI of a couple different PRs mentioned above