-
-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Hi, first of of all, let me acknowledge what a fantastic package this is. I have successfully used MPI.jl for embarassingly parallel groups of non-embarassingly parallel jobs, for up to 8,320 cpus and 5 hours. I wrote the non-embarassingly parallel code with Distributed and DistributedArrays, once it worked to my satisfaction, I scaled it all up using MPI.jl to 100s then 1000s of cores.
This has been working fine over the past few years, but I noticed that of late, I was getting odd behaviour without code changes in my package, which leads to every 1000 iterations of my code, taking increasingly large amounts of time.
Each single iteration can be parallelised, after which I need to do some stuff before the next iteration so my main code block looks like this
for isample = 1:nsamples
@sync for (ipid, pid) in enumerate(pids)
@async r[ipid] = remotecall_fetch(largecalc, pid, results, data, isample)
end
# do stuff
endWhere the results and data are a DArray the localparts of which lie on pid. With an MWE, I have nailed it (I think) to changes with MPI.jl after the breaking 0.20.0 release. I am not fussed about the environment variables changing at all, I think MPIPreferences makes it all much simpler after 0.20.0 on MPI.jl. Here is the MWE using only Distributed.jl, contrasted with using Distributed.jl with MPI.jl and MPIClusterManagers.jl (alas, my cluster only allows MPI transport), which can be run on a desktop (Mac M2 2023, 24 GB RAM, 8 cores, julia-1.12.4, homebrew mpirun (Open MPI) 5.0.9):
Module which initialises DArrays in a function, then defines a function to do some work with them, called within another function that has parallelisation per iteration, that writes the result to a file:
module TestMPISlowdown
using Distributed, DistributedArrays
function init_darrays(pids, nvars)
results_ = Array{Future, 1}(undef, length(pids))
data_ = Array{Future, 1}(undef, length(pids))
@sync for (idx, pid) in enumerate(pids)
results_[idx] = @spawnat pid [[0.]]
data_[idx] = @spawnat pid [rand(nvars)]
end
DArray(results_), DArray(data_)
end
function largecalc(results, data, n)
s = (data)'data
results[1] = results[1]*(n-1) + s
results[1] /= n
end
function largecalc(results::DArray, data::DArray, n::Integer)
largecalc(localpart(results)[1], localpart(data)[1], n)
end
function doiters(nsamples, pids, results::DArray, data::DArray, fname::String, str::String="")
r = zeros(length(pids))
tlong = time()
for isample = 1:nsamples
@sync for (ipid, pid) in enumerate(pids)
@async r[ipid] = remotecall_fetch(largecalc, pid, results, data, isample)
end
if mod(isample-1, 1000) == 0
dt = time() - tlong
tlong = time()
if isample == 1
fp = open(fname, "w")
write(fp, str*"\n")
else
fp = open(fname, "a")
end
write(fp, "$isample\t$dt\n")
flush(fp); close(fp)
end
end
end
end # ModuleHere is the script useTestMPISlowdown_MPI.jl to call this module using mpiexec
using MPIClusterManagers, Distributed
import MPI
MPI.Init()
rank = MPI.Comm_rank(MPI.COMM_WORLD)
sz = MPI.Comm_size(MPI.COMM_WORLD)
if rank == 0
@info "size is $sz"
end
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
@info "there are $(nworkers()) workers"
@everywhere using Distributed
@everywhere push!(LOAD_PATH,pwd())
@everywhere using TestMPISlowdown
## some settings
nsamples = 20001 # serial iterations of parallel work
nvars = 2^25 # dimension of random variable x to play with
vstring = "$(pkgversion(MPI))" # MPI.jl version
fname = "testMPI_$vstring.txt" # file to write output in
## do the work and write timings
pids = workers()
r, d = TestMPISlowdown.init_darrays(pids, nvars); # return DArrays
TestMPISlowdown.doiters(nsamples, pids, r, d, fname, "MPI.jl v"*vstring)
## remove workers
MPIClusterManagers.stop_main_loop(manager)
rmprocs(workers())
exit()In zsh if I now do
mpiexec -np 6 julia useTestMPISlowdown_MPI.jl
Then with MPI.jl vs 0.20.32 I get this increase of time per 1000 iterations, consistent with what I see in my code on the HPC cluster.
MPI.jl v0.20.23
1 0.15433502197265625
1001 22.368630170822144
2001 22.45356798171997
3001 22.8455069065094
4001 23.444272994995117
5001 23.500241994857788
6001 23.8661470413208
7001 24.7068829536438
8001 25.006561040878296
9001 26.50983691215515
10001 26.9981791973114
With plain Distributed.jl we can still run the ##some settings and ## do the work and write timings part of the calling code above after doing
cd(@__DIR__)
using Distributed
addprocs(5)
@everywhere push!(LOAD_PATH,pwd())
@everywhere using TestMPISlowdownto get (for 20_001 samples now, just to check), no increase in time per 1000 iterations:
Plain Distributed.jl
1 0.07457900047302246
1001 20.80730700492859
2001 20.711559057235718
3001 20.714218139648438
.
.
10001 21.25517702102661
.
.
17001 20.81500005722046
18001 20.852991104125977
19001 20.885419845581055
20001 20.884173154830933
This closely matches MPI.jl after switching back to v0.19.2 and doing mpiexec -np 6 julia useTestMPISlowdown_MPI.jl
MPI.jl v0.19.2
1 0.14057302474975586
1001 21.441424131393433
2001 21.38768696784973
3001 21.708193063735962
4001 21.66358494758606
5001 21.376691102981567
6001 21.391920804977417
7001 21.380391120910645
8001 21.381767988204956
9001 21.376633882522583
10001 21.417943000793457
I can see similar behaviour on the cluster as well, this time with intel-mpi/2021.10.0 . Time blows up with MPI.jl v0.20.23, but not 0.19.2. On my Mac, switching over to the jll didn't help either.
Any ideas as to what might be going on? Many thanks.