Combining distributed MPI outputs is slow #4669

taimoorsohail · 2025-07-25T05:22:30Z

taimoorsohail
Jul 25, 2025
Collaborator

Hi everyone,

I am running a parallel GPU setup of ClimaOcean using CUDA and MPI. I am outputting a set of daily surface property fields (let's say T, S, u, v, w, e) for 15 years = 5475 days.

For three GPUs, the output file is something like global_surface_fields_onedeg_iteration0_rank0.jld2, global_surface_fields_onedeg_iteration0_rank1.jld2, and global_surface_fields_onedeg_iteration0_rank2.jld2, and the files are split along the y-axis.

I have written a function that combines the ranks of these output files into a FieldTimeSeries by saving a Field at a given time step, then using set!(FTS, field, time_index) I set the field into the FieldTimeSeries (thanks @simone-silvestri for the code):

prefix = path/to/file
prefix_out = prefix
ranks = [0,1,2]
        utmp = FieldTimeSeries{Face,   Center, Nothing}(grid, times; backend=OnDisk(), path=prefix_out * "_iteration0.jld2", name="u")

        function set_distributed_field_time_series!(fts, prefix, ranks)
            Nx, Ny, Nz, Hx, Hy, Hz, nx, ny, z_faces = grid_metrics(prefix * "_iteration0", ranks)
            field = Field{location(fts)...}(grid)
            Ny = size(fts, 2)

            for rank in ranks
                file   = jldopen(prefix * "_iteration0_rank$(rank).jld2")
                irange = ny * rank + 1 : ny * (rank + 1)
                for (idx, iter) in enumerate(iters)
                    data   = file["timeseries/$(fts.name)/$(iter)"][:, :, 1]
                    interior(field, :, irange, 1) .= data
                    @time set!(fts, field, idx)
                end
                close(file)
            end
        end

        set_distributed_field_time_series!(utmp, prefix_out, ranks)

Unfortunately, the set!(fts, field, idx) line takes 0.03s per time step per rank per variable.

So, if we are combining 6 variables over 15 years of daily data and 3 GPUs, you are looking at 50 minutes. The more files, the longer this will take.

Is there a way to speed this step up?

taimoorsohail · 2025-07-25T05:23:27Z

taimoorsohail
Jul 25, 2025
Collaborator Author

cc @navidcy

0 replies

taimoorsohail · 2025-07-25T05:29:44Z

taimoorsohail
Jul 25, 2025
Collaborator Author

I was thinking we could either find a way to not loop through time, or somehow set the FTS directly, without using the set! function. I am open to suggestions.

I have put this is Oceananigans because, even though my use case is in ClimaOcean, this would apply to distributed Oceananigans simulations over long timescales also.

0 replies

simone-silvestri · 2025-07-25T06:50:48Z

simone-silvestri
Jul 25, 2025
Maintainer

Should't you swap the loop in ranks with the loop in time?
Something like this:

        file   = jldopen(prefix * "_iteration0_rank$(rank).jld2")
        for (idx, iter) in enumerate(iters)
             for rank in ranks
                 irange = ny * rank + 1 : ny * (rank + 1)
                data   = file["timeseries/$(fts.name)/$(iter)"][:, :, 1]
                interior(field, :, irange, 1) .= data
             end
        @time set!(fts, field, idx)
    end
   close(file)

Also, why is the file called *_iteration0_* is it just one iteration?

1 reply

taimoorsohail Jul 28, 2025
Collaborator Author

It is called iteration 0 only because the first iteration of the time series in the file is 0; this is a side effect of the checkpointing that I have implemented, which saves the jld2 files with the first iteration number so that when the simulation restarts using a checkpoint a new file name is saved. So no, it doesn't just have a single iteration.

As for the order of rank loops, I tried swapping the loop in ranks and loop in times and it doesn't make a difference to the speed of the combination.

glwagner · 2025-07-25T07:12:01Z

glwagner
Jul 25, 2025
Maintainer

Won't this often blow up memory on CPU? (Just curious). If you have the need to distribute the simulation on many GPUs, it is probably quite a large simulation and so this algorithm will fail in many cases, right? But there is a niche of cases that work, where a modestly-sized simulation is distributed among a small number of GPUs?

2 replies

taimoorsohail Jul 28, 2025
Collaborator Author

You are probably right, I have managed to make it work for a 1/6th model but not that many time steps. I am open to less memory intensive methods, but at the moment would love an option that doesn't take so long to set! each time step...

navidcy Jul 30, 2025
Maintainer

If the output is time series of 3D fields then definitely it would blow up the memory if we try to combine them. But if the outputs are reductions of time series of 2D or 1D fields.

What does set! do for a FieldTimeSeries? For the case we are referring to this:

Oceananigans.jl/src/OutputReaders/set_field_time_series.jl

Lines 69 to 82 in fb96739

    
           function set!(fts::InMemoryFTS, fields_vector::AbstractVector{<:AbstractField}) 
        
               raw_data = parent(fts) 
        
               file = jldopen(path; fts.reader_kw...) 
        
               for (n, field) in enumerate(fields_vector) 
        
                   nth_raw_data = view(raw_data, :, :, :, n) 
        
                   copyto!(nth_raw_data, parent(field)) 
        
                   # raw_data[:, :, :, n] .= parent(field) 
        
               end 
        
               close(file) 
        
               return nothing 
        
           end

? Does that mean that we open and close the jld2 file when combining every iteration? Should we bypass this?

glwagner · 2025-07-25T07:13:37Z

glwagner
Jul 25, 2025
Maintainer

Another comment --- I think that you should not combine all of the data in a single step. I think you should combine data as-needed, on-the-fly, and use OnDisk backend. Won't that work better in more cases (especially for long time series that cannoot be loaded into memory)?

5 replies

taimoorsohail Jul 28, 2025
Collaborator Author

So do you mean when I actually try to plot the data, it does the combination then for the specific time step I have tried to plot? This is definitely possible for something like surface field snapshots, but I won't be able to more complex diagnostics like a line plot of the maximum SST over time, or the ocean heat content over time, right?

glwagner Jul 30, 2025
Maintainer

That's true. But to compute reductions over data that is distributed over many files, I also do not think you want to combine the data into a single field, and then compute a reduction. This will also often fail for the same reasons I was mentioning (the data is too large to fit in memory). So you need to compute reductions for each file individually, and then collect those results to compute the global reduction. There is an infastructure for doing this online during a simulation which would be nice to reuse, right @navidcy ?

navidcy Jul 30, 2025
Maintainer

I'm not sure I'm following exactly. I think I need a concrete example? Perhaps ignore me tho.

I don't think @taimoorsohail is implying that they are saving 3D fields from multiple ranks and combine them to do reductions later if that's what was inferred.

navidcy Jul 30, 2025
Maintainer

Actually I'm sorry, I now see that the top cell does give an example! 2D surface fields distributed across 3 ranks.

(I'll think through tomorrow)

glwagner Jul 31, 2025
Maintainer

Also relevant is https://github.com/JuliaParallel/Dagger.jl which could be used for some of these things (provided we cannot immediately re-use the parallel reduction tools)

Combining distributed MPI outputs is slow #4669

Uh oh!

Uh oh!

taimoorsohail Jul 25, 2025 Collaborator

Replies: 5 comments · 8 replies

Uh oh!

taimoorsohail Jul 25, 2025 Collaborator Author

Uh oh!

Uh oh!

taimoorsohail Jul 25, 2025 Collaborator Author

Uh oh!

Uh oh!

simone-silvestri Jul 25, 2025 Maintainer

Uh oh!

Uh oh!

taimoorsohail Jul 28, 2025 Collaborator Author

Uh oh!

glwagner Jul 25, 2025 Maintainer

Uh oh!

taimoorsohail Jul 28, 2025 Collaborator Author

Uh oh!

navidcy Jul 30, 2025 Maintainer

Uh oh!

glwagner Jul 25, 2025 Maintainer

Uh oh!

taimoorsohail Jul 28, 2025 Collaborator Author

Uh oh!

glwagner Jul 30, 2025 Maintainer

Uh oh!

navidcy Jul 30, 2025 Maintainer

Uh oh!

navidcy Jul 30, 2025 Maintainer

Uh oh!

glwagner Jul 31, 2025 Maintainer

taimoorsohail
Jul 25, 2025
Collaborator

Replies: 5 comments 8 replies

taimoorsohail
Jul 25, 2025
Collaborator Author

taimoorsohail
Jul 25, 2025
Collaborator Author

simone-silvestri
Jul 25, 2025
Maintainer

taimoorsohail Jul 28, 2025
Collaborator Author

glwagner
Jul 25, 2025
Maintainer

taimoorsohail Jul 28, 2025
Collaborator Author

navidcy Jul 30, 2025
Maintainer

glwagner
Jul 25, 2025
Maintainer

taimoorsohail Jul 28, 2025
Collaborator Author

glwagner Jul 30, 2025
Maintainer

navidcy Jul 30, 2025
Maintainer

navidcy Jul 30, 2025
Maintainer

glwagner Jul 31, 2025
Maintainer