Tune kernels for use with FastCartesianIndices #2296

charleskawczynski · 2025-04-18T13:18:41Z

After reviewing performance in the land model, I've come to accept that performance characteristics using multi-dimensional launch configurations are quite sharp. For example, we see good performance with high resolution simulations, but excessively poor if the vertical resolution is low.

All three parts of the launch config and index management:

Reasoning about the launch configuration
Computing the universal index from the thread / block IDs
and checking for valid indices

are entangled, and increasing complexity of the launch configuration requires increasing complexity of the other two.

So, I took a second look at the previous design that I had put in place: always use linear launch configurations and use CartesianIndices. The issue with that is that CartesianIndices(...)[::Int] results in integer division, which is slow and can hurt performance by as much as 2x.

Fortunately, I managed to get something that seems to be working, and I started the registration process for ClimaCartesianIndices.jl, which defines FastCartesianIndices, a drop-in replacement for CartesianIndices. The difference is that FastCartesianIndices(...)[::Int] avoids integer division by using bit tricks.

See the ClimaCartesianIndices.jl documentation for more information. We could reach near linear indexing speeds by passing FastCartesianIndices through to kernels via Val, but that will allocate because Nh is not in the type domain (we removed it because it, for some unknown reason, spiked compilation times). For now, I think it's still a better trade-off to use CartesianIndices and have more robust performance.

As a result, I'm going to revert most of our multi-dimensional launch configurations to linear ones, and start using FastCartesianIndices.

For now, I'm going to skip the kernels that use shared memory (SEM and FD shmem), since we never had linear launch configurations for those to begin with, and I'm not yet sure how to make that work. That will likely be next on the ticket.

Overall, these changes should yield good performance improvements for the land model, and recover performance losses for our lower resolution experiments, across the board.

charleskawczynski · 2025-04-18T15:51:02Z

This PR should fix the catastrophic performance of the surface kernel that we saw in the land (cc @kmdeck). And disabling shared memory will actually fix the other issue, since that will also result in a linear launch configuration. By setting Operators.use_fd_shmem() = false right after loading ClimaCore in the driver (i.e., in global scope).

We should figure out how to make the shared memory kernels more robust for different resolutions, but I think this PR at least addresses fixes for every other case.

charleskawczynski · 2025-04-18T16:11:12Z

When I said sharp, I meant sharp 🔪. Here are results for the benchmark added in #2294 (look at time distributions):

shmem launch config (very, very bad for this resolution)

Device-side activity: GPU was busy for 171.26 ms (0.37% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                                                                                          ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────
│    0.37% │  171.26 ms │     4 │  42.82 ms ± 1.22   ( 42.17 ‥ 44.64) │ copyto_stencil_kernel_shmem_(Field<VIJFH<BandMatrixRow<-1, 3, Float64>, 10, 4, CuDeviceArray< ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────

datalayout launch config (better for this resolution) (done with main branch + disabling shmem)

Device-side activity: GPU was busy for 3.28 ms (0.01% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                         ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────
│    0.01% │    3.28 ms │     4 │ 820.46 µs ± 4.44   (813.96 ‥ 823.97) │ copyto_stencil_kernel_(Field<VIJFH<BandMatrixRow<-1, 3, Float64>, 10, 4, CuDeviceArray<Float ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────

This PR + disabling shmem:

Device-side activity: GPU was busy for 1.8 ms (0.01% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                         ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────
│    0.01% │     1.8 ms │     4 │ 449.66 µs ± 2.91   (447.03 ‥ 452.52) │ copyto_stencil_kernel_(Field<VIJFH<BandMatrixRow<-1, 3, Float64>, 10, 4, CuDeviceArray<Float ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────

I'm not sure if it will make a difference for this kernel in particular, but this is with FastCartesianIndices(x) = CartesianIndices(x), as ClimaCartesianIndices.jl is not yet registered.

kmdeck · 2025-04-23T16:42:35Z

This is exciting, @charleskawczynski ! Nice job!

I'll try it out in land today

charleskawczynski added performance GPU labels Apr 18, 2025

charleskawczynski force-pushed the ck/cartesian_indices branch 4 times, most recently from 8c4f04c to 71704ec Compare April 18, 2025 15:07

charleskawczynski marked this pull request as ready for review April 18, 2025 15:51

Tune kernels for use with FastCartesianIndices

3aa0fd0

charleskawczynski force-pushed the ck/cartesian_indices branch from 71704ec to 3aa0fd0 Compare April 18, 2025 17:09

charleskawczynski merged commit 81645c9 into main Apr 18, 2025
33 of 35 checks passed

charleskawczynski deleted the ck/cartesian_indices branch April 18, 2025 18:23

charleskawczynski mentioned this pull request Apr 21, 2025

FD shmem thread-block configuration needs tuned #2305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tune kernels for use with FastCartesianIndices #2296

Tune kernels for use with FastCartesianIndices #2296

Uh oh!

charleskawczynski commented Apr 18, 2025 •

edited

Loading

Uh oh!

charleskawczynski commented Apr 18, 2025

Uh oh!

charleskawczynski commented Apr 18, 2025

Uh oh!

Uh oh!

kmdeck commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tune kernels for use with FastCartesianIndices #2296

Tune kernels for use with FastCartesianIndices #2296

Uh oh!

Conversation

charleskawczynski commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charleskawczynski commented Apr 18, 2025

Uh oh!

charleskawczynski commented Apr 18, 2025

Uh oh!

Uh oh!

kmdeck commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charleskawczynski commented Apr 18, 2025 •

edited

Loading