Combining distributed MPI outputs is slow #4669
Replies: 5 comments 8 replies
-
cc @navidcy |
Beta Was this translation helpful? Give feedback.
-
I was thinking we could either find a way to not loop through time, or somehow set the FTS directly, without using the I have put this is Oceananigans because, even though my use case is in ClimaOcean, this would apply to distributed Oceananigans simulations over long timescales also. |
Beta Was this translation helpful? Give feedback.
-
Should't you swap the loop in ranks with the loop in time? file = jldopen(prefix * "_iteration0_rank$(rank).jld2")
for (idx, iter) in enumerate(iters)
for rank in ranks
irange = ny * rank + 1 : ny * (rank + 1)
data = file["timeseries/$(fts.name)/$(iter)"][:, :, 1]
interior(field, :, irange, 1) .= data
end
@time set!(fts, field, idx)
end
close(file) Also, why is the file called |
Beta Was this translation helpful? Give feedback.
-
Won't this often blow up memory on CPU? (Just curious). If you have the need to distribute the simulation on many GPUs, it is probably quite a large simulation and so this algorithm will fail in many cases, right? But there is a niche of cases that work, where a modestly-sized simulation is distributed among a small number of GPUs? |
Beta Was this translation helpful? Give feedback.
-
Another comment --- I think that you should not combine all of the data in a single step. I think you should combine data as-needed, on-the-fly, and use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I am running a parallel GPU setup of ClimaOcean using CUDA and MPI. I am outputting a set of daily surface property fields (let's say
T, S, u, v, w, e
) for 15 years = 5475 days.For three GPUs, the output file is something like
global_surface_fields_onedeg_iteration0_rank0.jld2
,global_surface_fields_onedeg_iteration0_rank1.jld2
, andglobal_surface_fields_onedeg_iteration0_rank2.jld2
, and the files are split along the y-axis.I have written a function that combines the ranks of these output files into a
FieldTimeSeries
by saving aField
at a given time step, then usingset!(FTS, field, time_index)
I set the field into theFieldTimeSeries
(thanks @simone-silvestri for the code):Unfortunately, the
set!(fts, field, idx)
line takes 0.03s per time step per rank per variable.So, if we are combining 6 variables over 15 years of daily data and 3 GPUs, you are looking at 50 minutes. The more files, the longer this will take.
Is there a way to speed this step up?
Beta Was this translation helpful? Give feedback.
All reactions