Skip to content

htex mpi mode hang race condition on final task #3829

@benclifford

Description

@benclifford

Describe the bug
I've seen the htex MPI test hang in CI many times. I investigated it finally in a pull request, and commented on it (comments following #3800 (comment)), and now I know what to look for have seen it in unrelated PRs against master (https://github.com/Parsl/parsl/actions/runs/14083883279/job/39442860054)

Reproducing my comments here:

I have looked at the logs for that failed test and it looks like the hang is inside an MPI executor test, so I would like to review that a bit more seriously.

The run directory is runinfo/008 inside https://github.com/Parsl/parsl/actions/runs/13796405061/artifacts/2733031351

As an example of suspicious behaviour: htex task 106 is placed into the backlog in manager.log but no worker ever executes that task:

runinfo/008/MPI_TEST/block-0/9a0f7d3b72d8$ grep 106 worker_*.log

returns nothing.

I haven't audited the other tasks in that log for liveness.

If it is only the last task failing, then maybe a race condition around final task handling ("more tasks need to keep arriving to keep backlog tasks scheduled" or something like that?).`

and

I'll note that the final task completion is logged in the same millisecond as the final backlog placement:

2025-03-11 19:34:46.798 parsl.executors.high_throughput.mpi_resource_management:179 24205 Task-Puller [INFO]  Not enough resources, placing task 106 into backlog

2025-03-11 19:34:46.798 worker_log:764 24220 MainThread [INFO]  All processing finished for executor task 105

My gut then says theres a race condition:
on the task submit side:

T1: decide a task needs to be backlogged
T3: place the task in the backlog

on the results side:
T2: return from task 105, see that backlog is empty, do not schedule empty backlog

and now no event ever happens to schedule the backlog.

To Reproduce
I have not made a reproducer case - I made an attempt but getting a suitable delay in the right place was awkward so I stopped at making the above T1/T2/T3 timeline, for an MPI person to look at this more closely.

Expected behavior
no hang.

Environment
CI of a couple different PRs mentioned above

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions