Avoid launching small kernels #5327

jagoosw · 2026-02-22T14:18:45Z

jagoosw
Feb 22, 2026
Collaborator

Hi all,

I have been trying to work out how to make my simulations more performant and found that on relatively medium sized grids (2^19 points) and a few tracers (8 to 12) that the GPU I was using (A100) is not limited by compute (~30% utilisation), memory (a few GB), or bandwidth, and I came across the suggestion that this happens when you launch a lot of very small kernels so the GPU has to wait for the CPU to give it new instructions. I'm sure theres a proper way to work this out with Nsight but I haven't got that working yet.

Looking first at the NonhydrostaticModel code the obvious place to start is that we launch separate kernels for each velocity component and each tracer for a) tendency computation, b) timestepping, c) each flux BC, and probably more. As a proof of concept I've put all the tracer time step operation into one kernel (we can't put the velocity components in it too since they have to exclude_periphery):

Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_rk3_substep.jl

Lines 46 to 54 in d295c1b

    
           tracer_keys = keys(model.tracers) 
        
           launch!(architecture(grid), grid, :xyz,  
        
                   _rk3_substep_fields!,  
        
                   values(model.tracers),  
        
                   Δt, γⁿ, ζⁿ,  
        
                   values(model.timestepper.Gⁿ[tracer_keys]),  
        
                   values(model.timestepper.G⁻[tracer_keys]);  
        
                   exclude_periphery = false)

vs main:

Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_rk3_substep.jl

Lines 39 to 53 in 1f0d6e7

    
           for (i, name) in enumerate(keys(model_fields)) 
        
               field = model_fields[name] 
        
               exclude_periphery = i < 4 # We assume that the first 3 fields are velocity / momentum variables 
        
               kernel_args = (field, Δt, γⁿ, ζⁿ, model.timestepper.Gⁿ[name], model.timestepper.G⁻[name]) 
        
               launch!(architecture(grid), grid, :xyz, _rk3_substep_field!, kernel_args...; exclude_periphery) 
        
               implicit_step!(field, 
        
                              model.timestepper.implicit_solver, 
        
                              model.closure, 
        
                              model.closure_fields, 
        
                              Val(i-3), # We assume that the first 3 fields are velocity / momentum variables 
        
                              model.clock, 
        
                              fields(model), 
        
                              Δτ) 
        
           end

And I also put all the tracer tendency computations into one kernel:

Oceananigans.jl/src/Models/NonhydrostaticModels/compute_nonhydrostatic_tendencies.jl

Lines 157 to 187 in d295c1b

    
           @generated function compute_each_Gcs!(::Val{N},  
        
                                                i, j, k, 
        
                                                Gc, grid,  
        
                                                val_indices,  
        
                                                val_names,  
        
                                                advection,  
        
                                                closure, 
        
                                                immersed_bcs, 
        
                                                buoyancy, biogeochemistry, background_fields, velocities, 
        
                                                tracers, auxiliary_fields, closure_fields, 
        
                                                clock, 
        
                                                forcings) where N 
        
               blk = Expr(:block) 
        
               for n in 1:N 
        
                   push!(blk.args, :(Gc[$n+3][i, j, k] = tracer_tendency(i, j, k, grid, 
        
                                                                         val_indices[$n], 
        
                                                                         val_names[$n],  
        
                                                                         advection, 
        
                                                                         closure, 
        
                                                                         immersed_bcs[$n], 
        
                                                                         buoyancy, biogeochemistry, background_fields, velocities, 
        
                                                                         tracers, auxiliary_fields, closure_fields, 
        
                                                                         clock, 
        
                                                                         forcings[$n+3])))  
        
               end 
        
               return quote 
        
                   @inbounds $blk 
        
               end 
        
           end

This was a little more complicated due to the indexing into a bunch of different places but this @generted solution seems to work, maybe there is a simpler way.

On an A4500 these are the changes I get:

Main/branch	Grid	Forcing	Time per timestep (`@btime`)
Main	32^3	no	5 .295 ms
Branch	32^3	no	3.640 ms
Main	256^3	no	5.243 s
Branch	256^3	no	5.786 s
Main	32^3	yes	5.827 ms
Branch	32^3	yes	3.714 ms
Main	256x256x8	yes	8.755 ms
Branch	256x256x8	yes	4.052 ms

The changes also reduced allocations in all cases but I forgot to write them all down.

The forcing I used to make the tendency more expensive was (x, y, z, t)->for n in 1:100; sin(x)*cos(y)*tan(z)*sinh(t); end; return zero(x). The final case was meant to be representative of the model I was originally trying to run.

I'm not sure there are any downsides to this except sometimes code readability, but I wonder if I'm missing something or if others would support these kinds of changes? I think there are cases where this won't make much peformance difference, but I don't think its going to make anything slower?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid launching small kernels #5327

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Avoid launching small kernels #5327

Uh oh!

Uh oh!

jagoosw Feb 22, 2026 Collaborator

Replies: 0 comments

jagoosw
Feb 22, 2026
Collaborator