- 
                Notifications
    
You must be signed in to change notification settings  - Fork 250
 
Update CI to Julia version to 1.12.0 #4836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 
           I am very interested in this. Let's hope it works and we can move on from julia 1.10  | 
    
| 
           I am disabling the reactant tests for the moment to check if the rest works.  | 
    
| 
           If docs still break on the   | 
    
| 
           Seems that we are hitting the same NaN issue on the internal tide example  | 
    
          
 the ghosts of the past still haunt us....  | 
    
| 
           Apparently also   | 
    
| 
           If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?  | 
    
          
 One thing to try might be to run the example locally and on CI using the exact same Manifest.toml if possible. We can commit a Manifest.toml to this branch for debugging. I can't think of which dependency would lead to such a big difference but it's one thing we can control for.  | 
    
| 
           From the Julia v1.11 chat I recall that the error was showing up only for unix, not for mac?  | 
    
| 
           With this environment (manifests for v1.11 and v1.12 both included)   | 
    
| 
           I can make the simulation error early with diff --git a/src/Diagnostics/nan_checker.jl b/src/Diagnostics/nan_checker.jl
index 57945c5dc..893a9e283 100644
--- a/src/Diagnostics/nan_checker.jl
+++ b/src/Diagnostics/nan_checker.jl
@@ -5,7 +5,7 @@ mutable struct NaNChecker{F}
     erroring :: Bool
 end
 
-NaNChecker(fields) = NaNChecker(fields, false) # default
+NaNChecker(fields) = NaNChecker(fields, true) # default
 default_nan_checker(model) = nothing
 
 function Base.summary(nc::NaNChecker)
@@ -28,7 +28,7 @@ a container with key-value pairs like a dictionary or `NamedTuple`.
 
 If `erroring=true`, the `NaNChecker` will throw an error on NaN detection.
 """
-NaNChecker(; fields, erroring=false) = NaNChecker(fields, erroring)
+NaNChecker(; fields, erroring=true) = NaNChecker(fields, erroring)
 
 hasnan(field::AbstractArray) = any(isnan, parent(field))
 hasnan(model) = hasnan(first(fields(model)))I presume there's also a way to set the  Can we use a callback to print out to file all the steps, so that we can compare 1:1 the progress on different machines?  Presumably we're initially interested in the field   | 
    
| 
           Before Oceananigans.jl/examples/internal_tide.jl Line 175 in ea25179 
 julia> simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.281029, min=0.281029, mean=0.281029on both machines, if I'm looking at the right field and this display says enough about it, then they're the same at the beginning, but then on macOS I have julia> time_step!(simulation); simulation.model.velocities.u
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds, wall time: 2.256 minutes, max|w|: 2.089e-03, m s⁻¹
[ Info:     ... simulation initialization complete (887.307 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (128.489 ms).
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.31715, min=0.265116, mean=0.280967
julia> time_step!(simulation); simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.335864, min=0.264486, mean=0.280859and on Ubuntu julia> time_step!(simulation); simulation.model.velocities.u
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds, wall time: 2.391 minutes, max|w|: 2.089e-03, m s⁻¹
[ Info:     ... simulation initialization complete (1.130 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (20.645 ms).
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.31715, min=0.265116, mean=0.280967
julia> time_step!(simulation); simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.333478, min=0.264645, mean=0.280863so there's a significant divergence already after two timesteps. Update: julia> time_step!(simulation); simulation.model.velocities.u
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds, wall time: 2.269 minutes, max|w|: 2.089e-03, m s⁻¹
[ Info:     ... simulation initialization complete (11.788 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (12.640 seconds).
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.31715, min=0.265116, mean=0.280967
julia> time_step!(simulation); simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.335864, min=0.264486, mean=0.280859is also what I see on Ubuntu with Julia v1.10, which is consistent with all versions of Julia on macOS.  | 
    
| 
           The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference.  aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence.  Can someone point me to the operation performed on the   | 
    
          
 Nice work so far though!! The entire time-step is a complex chain of operations. I do think it is a good start to save down all fields every time-step. We may find that differences arise in one field versus another. Note that the NaNChecker checks   | 
    
| 
           To save every iteration chnage this line Oceananigans.jl/examples/internal_tide.jl Line 170 in ea25179 
 to  The difference should arise in the very first time-step? We could compare those. It seems annoying laborious to do this across architectures, but maybe @giordano you have good ideas how to do this efficiently  | 
    
| 
           We made some progress during a pair-debugging session with @simone-silvestri (during which we discovered the typo fixed by #4901), we found that already after the first step the pressure is different, which is updated by Oceananigans.jl/src/Models/NonhydrostaticModels/update_hydrostatic_pressure.jl Lines 12 to 20 in 4265add 
 z_dot_g_bᶜᶜᶠ with
@inline my_z_dot_g_bᶜᶜᶠ(i, j, k, grid::Oceananigans.Grids.AbstractGrid{FT}, bf, C) where FT = FT(0.5) * (C.b[i, j, k] + C.b[i, j, k-1])seems to be a workaround, but that's puzzling because that's pretty much the same as 
 @inbounds.  Putting the @inbounds in the definition of my_z_dot_g_bᶜᶜᶠ makes the pressure diverge and the NaNs pop up again.  I still don't have a full explanation of what's happening though, nor why this happens only on some systems.
       | 
    
| 
           Issue with Julia v1.11+ seems to be fixed by JuliaGPU/KernelAbstractions.jl#653, resolved by @vchuravy independently from Oceananigans troubles, but very timely nonetheless 😁  | 
    
| 
           I will switch to a local accumulator in that kernel like we did in the   | 
    
| 
           It worked! The simulation does not produce NaN anymore. Now we have other problems to solve, but they all seem easier.  | 
    
| @inbounds pHY′[i, j, grid.Nz] = - z_dot_g_bᶜᶜᶠ(i, j, grid.Nz+1, grid, buoyancy, C) * Δzᶜᶜᶠ(i, j, grid.Nz+1, grid) | ||
| pᵏ = - z_dot_g_bᶜᶜᶠ(i, j, grid.Nz+1, grid, buoyancy, C) * Δzᶜᶜᶠ(i, j, grid.Nz+1, grid) | ||
| @inbounds pHY′[i, j, grid.Nz] = pᵏ | ||
| 
               | 
          ||
| for k in grid.Nz-1 : -1 : 1 | ||
| @inbounds pHY′[i, j, k] = pHY′[i, j, k+1] - z_dot_g_bᶜᶜᶠ(i, j, k+1, grid, buoyancy, C) * Δzᶜᶜᶠ(i, j, k+1, grid) | ||
| pᵏ -= z_dot_g_bᶜᶜᶠ(i, j, k+1, grid, buoyancy, C) * Δzᶜᶜᶠ(i, j, k+1, grid) | ||
| @inbounds pHY′[i, j, k] = pᵏ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we require KernelAbstractions.jl v0.9.39 we don't need the workaround anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 
           Doctests failures are due to #4840 (comment)  | 
    
No description provided.