Skip to content

Conversation

@MamziB
Copy link
Contributor

@MamziB MamziB commented Apr 28, 2025

This PR fixes a hang in osc/ucx observed during MPI_Finalize with Graph500 by ensuring no outstanding operations remain on default endpoints before they are destroyed.

@MamziB
Copy link
Contributor Author

MamziB commented Apr 28, 2025

@janjust this should fix the Graph500 simple test bus error/ hang during MPI_Finalize.

return ret;
}

if (!opal_common_ucx_thread_enabled &&
Copy link
Member

@bosilca bosilca Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two problems with this conditional collective, and they both lead to deadlocks.

  1. opal_common_ucx_thread_enabled is inheriting the value of the thread level from MPI_Thread_init, and in MPI processes on the same communicator can have different levels of thread support. Thus, the true branch of this conditional can be taken by some processes, but not all in the communicator, and the barrier added here will deadlock.
  2. imagine a setup where some processes still have active windows while some others don't. The second part of the condition mca_osc_ucx_component.num_modules == 1 will then be true only on some processes, leading again to the barrier not being called everywhere in the communicator.
  3. Why is this going over all the processes in MPI_COMM_WORLD instead of all the processes in the communicator ?

The issue here is that a collective behavior cannot be decided by a non globally-consistent decision process (threading level of last window).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is in MPI_Finalize as indicated in the log message, why are you adding a fix in win_free (which is called for every window) ?

I would have suggested to make this code unconditional and move in the the OSC UCX component_close but unfortunately the OSC components are closed using the ompi_mpi_instance_append_finalize and I don't know what exactly is the order compared with the PML and COLL components (both needed for this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. In this case, I need to move the Barrier outside of the conditional block.

Why is this going over all the processes in MPI_COMM_WORLD instead of all the processes in the communicator ?
The objective is to ensure that during MPI_Finalize, all communication endpoints are properly flushed. To achieve this, we enforce a barrier both before and after the flush operation across all created endpoints (EPs).
It is important to note that EPs are created dynamically whenever communication occurs between two processes. Therefore, not all entries in the endpoints array are guaranteed to be valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I do not have a communicator object during component_finalize to enforce the Barrier, I am trying to flush during the last window free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the commit, please take a look. Thank you

Copy link
Contributor Author

@MamziB MamziB Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am aware of that. That should use the same worker as the default worker, as here we are running single-threaded mpi. However, it seems like that was not enough, and we need to call EP flush individually on each target, too. Let me try to convert Worker flush and just use EP flush, and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the below code to flush, but the finalize hang still occurs.

    int comm_size = ompi_comm_size (module->comm);
    for (i = 0; i < comm_size; i++) {
        assert(module->ctx != 0);
        ret = opal_common_ucx_ctx_flush(module->ctx, OPAL_COMMON_UCX_SCOPE_EP, i);
        if (OPAL_SUCCESS != ret) {
            return ret;
        }
    }
    

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have unfolded this function completely ! First because I don't understand why we need two barriers for each window_free ? And second because I don't understand how this fixes the problem. Let's assume we are using the UCX PML and a collective framework based on point-to-point. How comes that the barrier, which enqueues very short messages into most of the EP without a real synchronization, is not subject to the same issue (aka. pending acks while trying to finalize) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I opened the PR, I was under the impression that default EPs were not being flushed during window_free. However, after further discussion and a deeper dive into the code, it appears that the worker being flushed in window_free is indeed the default worker. So in theory, we should be covered in terms of flushing before MPI_Finalize.

Given that, what @devreal suggested about converting worker flush to EP flush should have worked — but it seems that it doesn’t.

I’ll continue investigating to better understand how this patch is helping to resolve the issue.

Copy link
Member

@bosilca bosilca Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it solves it the same way as an external barrier did in the past, it introduces a delay before closing the endpoints allowing the acks to flow back and thus hiding the core issue in most of the cases (but without really solving it).

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment.

@MamziB MamziB force-pushed the mamzi/osc-ucx-finalize branch from d2d7778 to 0886a6b Compare April 28, 2025 19:48
@MamziB MamziB force-pushed the mamzi/osc-ucx-finalize branch from 0886a6b to 224168c Compare April 28, 2025 19:50
@bosilca bosilca marked this pull request as draft April 29, 2025 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants