- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled from release source - with UCX
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
Please describe the system on which you are running
- Operating system/version: Rocky Linux 8
- Computer hardware: Gadi supercomputer, 48 cores per node See here for more details: https://nci.org.au/our-systems/hpc-systems
- Network type: IB
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Problem I'm trying to solve: Halo exchange of ocean model running on unstructured mesh
Solution 1: Was using RMA, each partition process would:
MPI_Fence(..)
MPI_Get()
MPI_Fence()
Got terrible performance in scaling, basically as the number of partitions increased, the number of total exchanges did not decrease much and the number of exchanges between partitions got smaller. Tried all sorts of sync'ing methods, eventually advice from the HPC folks was, do not use RMA
Solution 2: Switched to using MPI_Sendrecv() - performance increased 10 fold! i.e. total time for comms
BUT, this was in my test program. While the comms time in my full application did improve "most of the time", every now and then the Sendrecv() would take many secs to complete. So timing would look something like this:
     fill_3d_w2w 	:  			3.64980
     fill_3d_w2w 	:  			1.35756
     fill_3d_w2w 	:  			0.01945
     fill_3d_w2w 	:  			0.01938
     fill_3d_w2w 	:  			0.01928
     fill_3d_w2w 	:  			0.01969
     fill_3d_w2w 	:  			9.61830
     fill_3d_w2w 	:  			0.01991
     fill_3d_w2w 	:  			0.01956
     fill_3d_w2w 	:  			0.01946
     fill_3d_w2w 	:  			0.01933
     fill_3d_w2w 	:  			0.01984
     fill_3d_w2w 	:  			0.01907
     fill_3d_w2w 	:  			0.01945
     fill_3d_w2w 	:  			0.01974
     fill_3d_w2w 	:  			0.01916
     fill_3d_w2w 	:  			0.38533
     fill_3d_w2w 	:  			7.96889
     fill_3d_w2w 	:  			0.01937
     fill_3d_w2w 	:  			0.01916
     fill_3d_w2w 	:  			0.01936
     fill_3d_w2w 	:  			0.01932
     fill_3d_w2w 	:  			0.01008
     fill_3d_w2w 	:  			0.01040
     fill_3d_w2w 	:  			0.01947
     fill_3d_w2w 	:  			0.02222
     fill_3d_w2w 	:  			0.01956
     fill_3d_w2w 	:  			0.01943
     fill_3d_w2w 	:  			0.01919
     fill_3d_w2w 	:  			0.01946
     fill_3d_w2w 	:  			0.00461
     fill_3d_w2w 	:  			11.92141
     fill_3d_w2w 	:  			0.02010
There is an MPI_Barrier just before the comms, and I'm wrapping gettimeofday to get the timing. Have tried running, ISends, with Recv's and even created a Dist_graph and using MPI_Neighbor_alltoallw - all with similar results.
While MPI_Get was slower, it was always consistent/stable. i.e. timing for each iteration was very similar. Of course, load imbalance can always be an issue but here I'm just doing a tic/toc around the comms, with a Barrier just before.
Note also, this is all on the same node.
Any ideas?