MPI error with `kernel_launching.jl` #3981

francispoulin · 2024-12-12T15:50:14Z

francispoulin
Dec 12, 2024
Collaborator

We (@jakob-braga and @francispoulin ) are trying to run distributed_nonhydrostatic_turbulence.jl and getting an error.

We have tried two different servers and found the error in both is due to kernel_launching.jl.

@glwagner @simone-silvestri ?

ERROR: LoadError: ERROR: LoadError: MethodError: no method matching interior_work_layout(::RectilinearGrid{Float64, Periodic, Periodic, Flat, Oceananigans.Grids.StaticVerticalCoordinate{Nothing, Float64}, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Distributed{CPU, false, Partition{Nothing, Nothing, Nothing}, Tuple{Int64, Int64, Int64}, Int64, Tuple{Int64, Int64, Int64}, Oceananigans.DistributedComputations.RankConnectivity{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, MPI.Comm, Vector{MPI.Request}, Base.RefValue{Int64}}}, ::Tuple{}, ::Tuple{DataType, DataType, DataType})

Closest candidates are:
  interior_work_layout(::Any, !Matched::Symbol, ::Any)
   @ Oceananigans ~/software/Oceananigans.jl/src/Utils/kernel_launching.jl:133

simone-silvestri · 2024-12-12T16:55:17Z

simone-silvestri
Dec 12, 2024
Maintainer

Weird, I run it on main and I don't have such problems:

(base) simonesilvestri@Simones-MacBook-Pro Oceananigans.jl % mpiexecjl -np 4 julia --project validation/distributed_simulations/distributed_nonhydrostatic_turbulence.jl
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
grid = grid = 64×256×1 RectilinearGrid{Float64, Oceananigans.Grids.FullyConnected, Periodic, Flat} on Distributed{CPU} with 3×3×0 halo
├── FullyConnected x ∈ [3.14159, 4.71239) regularly spaced with Δx=0.0245437
├── Periodic y ∈ [-1.91418e-18, 6.28319)  regularly spaced with Δy=0.0245437
└── Flat z                                64×256×1 RectilinearGrid{Float64, Oceananigans.Grids.FullyConnected, Periodic, Flat} on Distributed{CPU} with 3×3×0 halo
├── FullyConnected x ∈ [4.71239, 6.28319) regularly spaced with Δx=0.0245437
├── Periodic y ∈ [-1.91418e-18, 6.28319)  regularly spaced with Δy=0.0245437
└── Flat z

grid = 64×256×1 RectilinearGrid{Float64, Oceananigans.Grids.FullyConnected, Periodic, Flat} on Distributed{CPU} with 3×3×0 halo
├── FullyConnected x ∈ [2.41353e-18, 1.5708) regularly spaced with Δx=0.0245437
├── Periodic y ∈ [-1.91418e-18, 6.28319)     regularly spaced with Δy=0.0245437
└── Flat z
grid = 64×256×1 RectilinearGrid{Float64, Oceananigans.Grids.FullyConnected, Periodic, Flat} on Distributed{CPU} with 3×3×0 halo
├── FullyConnected x ∈ [1.5708, 3.14159) regularly spaced with Δx=0.0245437
├── Periodic y ∈ [-1.91418e-18, 6.28319) regularly spaced with Δy=0.0245437
└── Flat z
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Iteration: 0, time: 0 seconds
[ Info: Rank 1: max|ζ|: 7.58e+01, max(e): 2.33e-01
[ Info: Rank 3: max|ζ|: 7.49e+01, max(e): 2.17e-01
[ Info: Rank 0: max|ζ|: 7.60e+01, max(e): 2.39e-01
[ Info: Rank 2: max|ζ|: 7.54e+01, max(e): 2.52e-01
[ Info:     ... simulation initialization complete (14.299 seconds)
[ Info:     ... simulation initialization complete (14.139 seconds)
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info:     ... simulation initialization complete (14.232 seconds)
[ Info: Executing initial time step...
[ Info:     ... simulation initialization complete (14.290 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (5.912 seconds).
[ Info:     ... initial time step complete (5.913 seconds).
[ Info:     ... initial time step complete (5.913 seconds).
[ Info:     ... initial time step complete (5.883 seconds).
[ Info: Iteration: 10, time: 100.000 ms
[ Info: Rank 2: max|ζ|: 4.50e+01, max(e): 9.66e-02
[ Info: Rank 3: max|ζ|: 4.45e+01, max(e): 9.25e-02
[ Info: Rank 0: max|ζ|: 4.28e+01, max(e): 9.77e-02
[ Info: Rank 1: max|ζ|: 4.48e+01, max(e): 9.63e-02
[ Info: Iteration: 20, time: 190.000 ms
[ Info: Rank 3: max|ζ|: 3.31e+01, max(e): 7.02e-02
[ Info: Rank 1: max|ζ|: 3.33e+01, max(e): 7.47e-02
[ Info: Rank 2: max|ζ|: 3.47e+01, max(e): 6.17e-02
[ Info: Rank 0: max|ζ|: 3.41e+01, max(e): 7.20e-02
[ Info: Iteration: 30, time: 280.000 ms
...
[ Info: Simulation is stopping after running for 47.977 seconds.
[ Info: Simulation is stopping after running for 47.941 seconds.
[ Info: Model iteration 1000 equals or exceeds stop iteration 1000.
[ Info: Simulation is stopping after running for 47.910 seconds.
[ Info: Model iteration 1000 equals or exceeds stop iteration 1000.
[ Info: Model iteration 1000 equals or exceeds stop iteration 1000.
[ Info: Simulation is stopping after running for 47.821 seconds.
[ Info: Model iteration 1000 equals or exceeds stop iteration 1000.
[ Info: Iteration: 1000, time: 9.170 seconds
[ Info: Rank 0: max|ζ|: 3.39e+00, max(e): 8.75e-03
[ Info: Rank 1: max|ζ|: 3.44e+00, max(e): 7.93e-03
[ Info: Rank 2: max|ζ|: 3.63e+00, max(e): 9.52e-03
[ Info: Rank 3: max|ζ|: 3.65e+00, max(e): 7.15e-03
(base) simonesilvestri@Simones-MacBook-Pro Oceananigans.jl %

Is your MPI configured correctly? Are you maybe using an old version of Oceananigans?

6 replies

simone-silvestri Dec 12, 2024
Maintainer

You need to make sure that julia "sees" the correct MPI, which is either the one you have installed or the one provided through the JLLS.
You can look at this documentation for more info: https://juliaparallel.org/MPI.jl/v0.20/configuration/

From what I know, there are two ways (at least I find these are the easiest ways to configure MPI):

use mpiexecjl: you can open julia and the do using MPI; MPI.install_mpiexecjl() then you run with the mpiexecjl executable you have in the .julia/bin folder.
configure the preferences: you need to do using MPIPreferences; MPIPreferences.use_system_binary() then you use the plain old mpirun or mpiexec (I have found that not always it works)

To make sure that the configuration is correct, test your MPI with a simple allreduce operation:

using MPI
MPI.Init()

rank = MPI.Comm_rank(MPI.COMM_WORLD)
size = MPI.Comm_size(MPI.COMM_WORLD)

ranks = zeros(Int, size)
ranks[rank+1] = rank + 1

MPI.Allreduce!(ranks, +, MPI.COMM_WORLD)

@info rank ranks

MPI.Finalize()

If your result is different than

(base) simonesilvestri@Simones-MacBook-Pro Oceananigans.jl % mpiexecjl -np 3 julia test.jl
┌ Info: 1
│   ranks =
│    3-element Vector{Int64}:
│     1
│     2
└     3
┌ Info: 2
│   ranks =
│    3-element Vector{Int64}:
│     1
│     2
└     3
┌ Info: 0
│   ranks =
│    3-element Vector{Int64}:
│     1
│     2
└     3

It means your MPI is not configured correctly

simone-silvestri Dec 12, 2024
Maintainer

It looks like you are not really partitioned correctly looking at your output, especially here

Partition{Nothing, Nothing, Nothing}

This suggests that your MPI.Comm_size(MPI.COMM_WORLD) == 1

simone-silvestri Dec 12, 2024
Maintainer

However it looks like there is an issue, since, if I run the validation with mpiexecjl -np 1 then I can reproduce the error.
I will create an issue from this discussion, because it looks like there is a bug

francispoulin Dec 12, 2024
Collaborator Author

thank you @simone-silvestri . You were correct, our MPI was not setup correctly.

We created mpiexecjl and that worked on one of the two servers we tried. We will contact the admit people to get it working on the second.

francispoulin Dec 12, 2024
Collaborator Author

I'm glad this was useful for finding a bug and happy to try something if that would help.

MPI error with kernel_launching.jl #3981

Uh oh!

francispoulin Dec 12, 2024 Collaborator

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

simone-silvestri Dec 12, 2024 Maintainer

Uh oh!

Uh oh!

simone-silvestri Dec 12, 2024 Maintainer

Uh oh!

simone-silvestri Dec 12, 2024 Maintainer

Uh oh!

Uh oh!

simone-silvestri Dec 12, 2024 Maintainer

Uh oh!

francispoulin Dec 12, 2024 Collaborator Author

Uh oh!

francispoulin Dec 12, 2024 Collaborator Author

MPI error with `kernel_launching.jl` #3981

francispoulin
Dec 12, 2024
Collaborator

Replies: 1 comment 6 replies

simone-silvestri
Dec 12, 2024
Maintainer

simone-silvestri Dec 12, 2024
Maintainer

simone-silvestri Dec 12, 2024
Maintainer

simone-silvestri Dec 12, 2024
Maintainer

francispoulin Dec 12, 2024
Collaborator Author

francispoulin Dec 12, 2024
Collaborator Author