Distributed.jl with MPI.jl: execution time increases with each iteration, after MPI.jl > 0.19.2

Hi, first of of all, let me acknowledge what a fantastic package this is. I have successfully used MPI.jl for embarassingly parallel groups of non-embarassingly parallel jobs, for up to 8,320 cpus and 5 hours. I wrote the non-embarassingly parallel code with Distributed and DistributedArrays, once it worked to my satisfaction, I scaled it all up using MPI.jl to 100s then 1000s of cores. 

This has been working fine over the past few years, but I noticed that of late, I was getting odd behaviour *without* code changes in my package, which leads to every 1000 iterations of my code, taking increasingly large amounts of time.

Each single iteration can be parallelised, after which I need to do some stuff *before* the next iteration so my main code block looks like this
```julia
for isample = 1:nsamples
        @sync for (ipid, pid) in enumerate(pids)
           @async r[ipid] = remotecall_fetch(largecalc, pid, results, data, isample)
        end
        # do stuff
end
```
Where the `results` and `data` are a `DArray` the `localpart`s of which lie on `pid`. With an MWE, I have nailed it (I think) to changes with `MPI.jl` after the breaking 0.20.0 release. I am not fussed about the environment variables changing at all, I think `MPIPreferences` makes it all much simpler after 0.20.0 on `MPI.jl.` Here is the MWE using only `Distributed.jl`, contrasted with using `Distributed.jl` with `MPI.jl` and `MPIClusterManagers.jl` (alas, my cluster only allows MPI transport), which can be run on a desktop (Mac M2 2023, 24 GB RAM, 8 cores, julia-1.12.4, homebrew mpirun (Open MPI) 5.0.9):

Module which initialises DArrays in a function, then defines a function to do some work with them, called within another function that has parallelisation per iteration, that writes the result to a file:
```julia
module TestMPISlowdown

using Distributed, DistributedArrays

function init_darrays(pids, nvars)
    results_ = Array{Future, 1}(undef, length(pids))
    data_ = Array{Future, 1}(undef, length(pids))
    @sync for (idx, pid) in enumerate(pids)
        results_[idx] = @spawnat pid [[0.]]
        data_[idx] = @spawnat pid [rand(nvars)]
    end
    DArray(results_), DArray(data_)
end

function largecalc(results, data, n)
    s = (data)'data
    results[1] = results[1]*(n-1) + s
    results[1] /= n
end

function largecalc(results::DArray, data::DArray, n::Integer)
    largecalc(localpart(results)[1], localpart(data)[1], n)
end

function doiters(nsamples, pids, results::DArray, data::DArray, fname::String, str::String="")
    r = zeros(length(pids))
    tlong = time()
    for isample = 1:nsamples
        @sync for (ipid, pid) in enumerate(pids)
           @async r[ipid] = remotecall_fetch(largecalc, pid, results, data, isample)
        end
        if mod(isample-1, 1000) == 0
            dt = time() - tlong
            tlong = time()
            if isample == 1
                fp = open(fname, "w")
                write(fp, str*"\n")
            else
                fp = open(fname, "a")
            end
            write(fp, "$isample\t$dt\n")
            flush(fp); close(fp)
        end
    end
end

end # Module
```
Here is the script `useTestMPISlowdown_MPI.jl` to call this module using `mpiexec`
```julia
using MPIClusterManagers, Distributed
import MPI
MPI.Init()
rank = MPI.Comm_rank(MPI.COMM_WORLD)
sz = MPI.Comm_size(MPI.COMM_WORLD)
if rank == 0
    @info "size is $sz"
end
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
@info "there are $(nworkers()) workers"
@everywhere using Distributed
@everywhere push!(LOAD_PATH,pwd())
@everywhere using TestMPISlowdown
## some settings
nsamples = 20001 # serial iterations of parallel work
nvars = 2^25 # dimension of random variable x to play with
vstring = "$(pkgversion(MPI))" # MPI.jl version 
fname = "testMPI_$vstring.txt" # file to write output in
## do the work and write timings
pids = workers()
r, d = TestMPISlowdown.init_darrays(pids, nvars); # return DArrays
TestMPISlowdown.doiters(nsamples, pids, r, d, fname, "MPI.jl v"*vstring)
## remove workers
MPIClusterManagers.stop_main_loop(manager)
rmprocs(workers())
exit()
```
In `zsh` if I now do 
`mpiexec -np 6 julia useTestMPISlowdown_MPI.jl`
Then with MPI.jl vs 0.20.32 I get this increase of time per 1000 iterations, consistent with what I see in my code on the HPC cluster. 
```
MPI.jl v0.20.23
1	0.15433502197265625
1001	22.368630170822144
2001	22.45356798171997
3001	22.8455069065094
4001	23.444272994995117
5001	23.500241994857788
6001	23.8661470413208
7001	24.7068829536438
8001	25.006561040878296
9001	26.50983691215515
10001	26.9981791973114
```
With plain Distributed.jl we can still run the `##some settings` and `## do the work and write timings` part of the calling code above after doing
```julia
cd(@__DIR__)
using Distributed
addprocs(5)
@everywhere push!(LOAD_PATH,pwd())
@everywhere using TestMPISlowdown
```
to get (for 20_001 samples now, just to check), no increase in time per 1000 iterations:
```
Plain Distributed.jl
1	0.07457900047302246
1001	20.80730700492859
2001	20.711559057235718
3001	20.714218139648438
.
.
10001	21.25517702102661
.
.
17001	20.81500005722046
18001	20.852991104125977
19001	20.885419845581055
20001	20.884173154830933
```
This closely matches `MPI.jl` after switching back to `v0.19.2` and doing `mpiexec -np 6 julia useTestMPISlowdown_MPI.jl`
```
MPI.jl v0.19.2
1	0.14057302474975586
1001	21.441424131393433
2001	21.38768696784973
3001	21.708193063735962
4001	21.66358494758606
5001	21.376691102981567
6001	21.391920804977417
7001	21.380391120910645
8001	21.381767988204956
9001	21.376633882522583
10001	21.417943000793457
```
I can see similar behaviour on the cluster as well, this time with intel-mpi/2021.10.0 . Time blows up with `MPI.jl v0.20.23`, but not `0.19.2`. On my Mac, switching over to the `jll` didn't help either. 

Any ideas as to what might be going on? Many thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed.jl with MPI.jl: execution time increases with each iteration, after MPI.jl > 0.19.2 #930

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Distributed.jl with MPI.jl: execution time increases with each iteration, after MPI.jl > 0.19.2 #930

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions