You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to work out how to make my simulations more performant and found that on relatively medium sized grids (2^19 points) and a few tracers (8 to 12) that the GPU I was using (A100) is not limited by compute (~30% utilisation), memory (a few GB), or bandwidth, and I came across the suggestion that this happens when you launch a lot of very small kernels so the GPU has to wait for the CPU to give it new instructions. I'm sure theres a proper way to work this out with Nsight but I haven't got that working yet.
Looking first at the NonhydrostaticModel code the obvious place to start is that we launch separate kernels for each velocity component and each tracer for a) tendency computation, b) timestepping, c) each flux BC, and probably more. As a proof of concept I've put all the tracer time step operation into one kernel (we can't put the velocity components in it too since they have to exclude_periphery):
This was a little more complicated due to the indexing into a bunch of different places but this @generted solution seems to work, maybe there is a simpler way.
On an A4500 these are the changes I get:
Main/branch
Grid
Forcing
Time per timestep (@btime)
Main
32^3
no
5 .295 ms
Branch
32^3
no
3.640 ms
Main
256^3
no
5.243 s
Branch
256^3
no
5.786 s
Main
32^3
yes
5.827 ms
Branch
32^3
yes
3.714 ms
Main
256x256x8
yes
8.755 ms
Branch
256x256x8
yes
4.052 ms
The changes also reduced allocations in all cases but I forgot to write them all down.
The forcing I used to make the tendency more expensive was (x, y, z, t)->for n in 1:100; sin(x)*cos(y)*tan(z)*sinh(t); end; return zero(x). The final case was meant to be representative of the model I was originally trying to run.
I'm not sure there are any downsides to this except sometimes code readability, but I wonder if I'm missing something or if others would support these kinds of changes? I think there are cases where this won't make much peformance difference, but I don't think its going to make anything slower?
performance 🏍️So we can get the wrong answer even faster
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I have been trying to work out how to make my simulations more performant and found that on relatively medium sized grids (2^19 points) and a few tracers (8 to 12) that the GPU I was using (A100) is not limited by compute (~30% utilisation), memory (a few GB), or bandwidth, and I came across the suggestion that this happens when you launch a lot of very small kernels so the GPU has to wait for the CPU to give it new instructions. I'm sure theres a proper way to work this out with Nsight but I haven't got that working yet.
Looking first at the
NonhydrostaticModelcode the obvious place to start is that we launch separate kernels for each velocity component and each tracer for a) tendency computation, b) timestepping, c) each flux BC, and probably more. As a proof of concept I've put all the tracer time step operation into one kernel (we can't put the velocity components in it too since they have toexclude_periphery):Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_rk3_substep.jl
Lines 46 to 54 in d295c1b
vs main:
Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_rk3_substep.jl
Lines 39 to 53 in 1f0d6e7
And I also put all the tracer tendency computations into one kernel:
Oceananigans.jl/src/Models/NonhydrostaticModels/compute_nonhydrostatic_tendencies.jl
Lines 157 to 187 in d295c1b
This was a little more complicated due to the indexing into a bunch of different places but this
@genertedsolution seems to work, maybe there is a simpler way.On an A4500 these are the changes I get:
@btime)The changes also reduced allocations in all cases but I forgot to write them all down.
The forcing I used to make the tendency more expensive was
(x, y, z, t)->for n in 1:100; sin(x)*cos(y)*tan(z)*sinh(t); end; return zero(x). The final case was meant to be representative of the model I was originally trying to run.I'm not sure there are any downsides to this except sometimes code readability, but I wonder if I'm missing something or if others would support these kinds of changes? I think there are cases where this won't make much peformance difference, but I don't think its going to make anything slower?
Beta Was this translation helpful? Give feedback.
All reactions