Skip to content

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Apr 18, 2025

After reviewing performance in the land model, I've come to accept that performance characteristics using multi-dimensional launch configurations are quite sharp. For example, we see good performance with high resolution simulations, but excessively poor if the vertical resolution is low.

All three parts of the launch config and index management:

  • Reasoning about the launch configuration
  • Computing the universal index from the thread / block IDs
  • and checking for valid indices

are entangled, and increasing complexity of the launch configuration requires increasing complexity of the other two.

So, I took a second look at the previous design that I had put in place: always use linear launch configurations and use CartesianIndices. The issue with that is that CartesianIndices(...)[::Int] results in integer division, which is slow and can hurt performance by as much as 2x.

Fortunately, I managed to get something that seems to be working, and I started the registration process for ClimaCartesianIndices.jl, which defines FastCartesianIndices, a drop-in replacement for CartesianIndices. The difference is that FastCartesianIndices(...)[::Int] avoids integer division by using bit tricks.

See the ClimaCartesianIndices.jl documentation for more information. We could reach near linear indexing speeds by passing FastCartesianIndices through to kernels via Val, but that will allocate because Nh is not in the type domain (we removed it because it, for some unknown reason, spiked compilation times). For now, I think it's still a better trade-off to use CartesianIndices and have more robust performance.

As a result, I'm going to revert most of our multi-dimensional launch configurations to linear ones, and start using FastCartesianIndices.

For now, I'm going to skip the kernels that use shared memory (SEM and FD shmem), since we never had linear launch configurations for those to begin with, and I'm not yet sure how to make that work. That will likely be next on the ticket.

Overall, these changes should yield good performance improvements for the land model, and recover performance losses for our lower resolution experiments, across the board.

@charleskawczynski charleskawczynski force-pushed the ck/cartesian_indices branch 4 times, most recently from 8c4f04c to 71704ec Compare April 18, 2025 15:07
@charleskawczynski
Copy link
Member Author

This PR should fix the catastrophic performance of the surface kernel that we saw in the land (cc @kmdeck). And disabling shared memory will actually fix the other issue, since that will also result in a linear launch configuration. By setting Operators.use_fd_shmem() = false right after loading ClimaCore in the driver (i.e., in global scope).

We should figure out how to make the shared memory kernels more robust for different resolutions, but I think this PR at least addresses fixes for every other case.

@charleskawczynski charleskawczynski marked this pull request as ready for review April 18, 2025 15:51
@charleskawczynski
Copy link
Member Author

When I said sharp, I meant sharp 🔪. Here are results for the benchmark added in #2294 (look at time distributions):

shmem launch config (very, very bad for this resolution)

Device-side activity: GPU was busy for 171.26 ms (0.37% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                                                                                          
├──────────┼────────────┼───────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────
│    0.37%171.26 ms │     442.82 ms ± 1.22   ( 42.1744.64) │ copyto_stencil_kernel_shmem_(Field<VIJFH<BandMatrixRow<-1, 3, Float64>, 10, 4, CuDeviceArray< 
└──────────┴────────────┴───────┴─────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────

datalayout launch config (better for this resolution) (done with main branch + disabling shmem)

Device-side activity: GPU was busy for 3.28 ms (0.01% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                         
├──────────┼────────────┼───────┼──────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────
│    0.01%3.28 ms │     4820.46 µs ± 4.44   (813.96823.97) │ copyto_stencil_kernel_(Field<VIJFH<BandMatrixRow<-1, 3, Float64>, 10, 4, CuDeviceArray<Float 
└──────────┴────────────┴───────┴──────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────

This PR + disabling shmem:

Device-side activity: GPU was busy for 1.8 ms (0.01% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                         
├──────────┼────────────┼───────┼──────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────
│    0.01%1.8 ms │     4449.66 µs ± 2.91   (447.03452.52) │ copyto_stencil_kernel_(Field<VIJFH<BandMatrixRow<-1, 3, Float64>, 10, 4, CuDeviceArray<Float 
└──────────┴────────────┴───────┴──────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────

I'm not sure if it will make a difference for this kernel in particular, but this is with FastCartesianIndices(x) = CartesianIndices(x), as ClimaCartesianIndices.jl is not yet registered.

@charleskawczynski charleskawczynski merged commit 81645c9 into main Apr 18, 2025
33 of 35 checks passed
@charleskawczynski charleskawczynski deleted the ck/cartesian_indices branch April 18, 2025 18:23
@kmdeck
Copy link
Member

kmdeck commented Apr 23, 2025

This is exciting, @charleskawczynski ! Nice job!

I'll try it out in land today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants