Name CUDA kernels based on stack trace in `auto_launch` #2376

petebachant · 2025-10-13T12:37:38Z

This PR adds stack trace-based CUDA kernel naming to make performance benchmarks more informative. Kernel names are computed once per session and cached in a module-level Dict.

Code follows the style guidelines OR N/A.
Unit tests are included OR N/A.
Code is exercised in an integration test OR N/A.
Documentation has been added/updated OR N/A.

TODO:

Confirm we get useful results, i.e., that we don't spend too much time computing the kernel name and wash out the timeline.

Example results from ClimaAtmos Buildkite

From here

petebachant · 2025-10-14T16:55:08Z

Added some example results, which look suspiciously different. Reminds me of Simon's comment about the CUDA API not doing much in the previous implementation that computed the name every time a kernel was created. Do we need to somehow hydrate the kernel name cache initially to get a useful result here?

imreddyTeja · 2025-10-15T17:16:18Z

ook suspiciously different. Reminds me of Simon's comment about the CUDA API not doing much in the previous implementation that computed the name every time a kernel was created. Do we need to somehow hydrate the kernel name cache initially to get a useful result here?

Are these profile results from the first step, or a later step? The kernel renaming example takes significantly longer, which makes me think stacktrace is getting called.

petebachant · 2025-10-20T17:21:37Z

~~I think I figured this out. The compile step was running before kernel renaming was enabled.~~ Nope, still taking way too long when this feature is enabled.

ext/cuda/cuda_utils.jl

imreddyTeja · 2025-10-20T21:50:17Z

ext/cuda/cuda_utils.jl

+                    "_" *
+                    string(frame.linfo.def.file) *
+                    "_" *
+                    string(frame.line)


It might be a good idea to either handle the case where frame.linfo.def.file is inside NVTX.jl specially, or to leave a note here that filenames and line numbers of kernels will be incorrect if inside an NVTX annotation.

What do you think about the format below? If it's an NVTX annotation, it's easy to see that. Likewise, it's easy to tell if the name is a function defined in a file:

I like this a lot. If a lot of kernels are getting caught in the NVTX annotations, it might make sense to reduce the number of NVTX annotations in downstream packages

…aming

imreddyTeja · 2025-10-24T19:40:04Z

ext/cuda/cuda_utils.jl

+        kernel_name_exists = key in keys(kernel_names)
+        if !kernel_name_exists


These syntax changes may be less clear, so feel free to ignore

kernel_name_exists = haskey(kernel_names, key) if !kernel_name_exists

or

kernel_name = get!(kernel_names, key) do #calculate_name_here end

ext/cuda/cuda_utils.jl

imreddyTeja · 2025-10-24T20:23:23Z

I just tried using this in ClimaLand, and it worked. This is so much easier than dev`ing GPUCompiler.jl

petebachant · 2025-10-24T20:57:30Z

The files don't make much sense in ClimaAtmos's buildkite runs since I guess they handle dev installs differently:

Maybe I should just split on src as a last resort.

petebachant · 2025-10-25T02:02:29Z

Breaking on src if we can't find any other tokens:

imreddyTeja · 2025-10-27T17:08:36Z

ext/cuda/cuda_utils.jl

+    kernel_name = nothing
+    if name_kernels_from_stack_trace()
+        # Create a key from the method instance and types of the args
+        key = objectid(methodinstance(typeof(f!), typeof(args)))


Suggested change

key = objectid(methodinstance(typeof(f!), typeof(args)))

key = objectid(methodinstance(F!, typeof(args)))

The function signature could also be changed to

function auto_launch!( f!::F!, args::ARGS, nitems::Union{Integer, Nothing} = nothing; auto = false, threads_s = nothing, blocks_s = nothing, always_inline = true, caller = :unknown, ) where {F!, ARGS} . . . key = objectid(methodinstance(F!, ARGS))

but I think that forced specialization of the method (I'm not sure if that is desirable or not here)

imreddyTeja · 2025-10-27T17:10:55Z

ext/cuda/cuda_utils.jl

 import ClimaCore.DataLayouts
 import ClimaCore.DataLayouts: empty_kernel_stats
+import ClimaCore.DebugOnly: name_kernels_from_stack_trace
+import CUDA.GPUCompiler: methodinstance


Not sure if importing a package through another package is recommended or not, and I can't find anything when I look online.

It does feel wrong. Would it be smarter to make GPUCompiler an explicit dependency of ClimaCore? Is there a nice way to make it optional for just the CUDA extension?

Reading a little more about this, I guess it would go in weakdeps?

I think ClimaLand is a good example of using extensions/weak reps. I'm not sure if there is a way to manage extensions with the package manager though (I never found a way for Julia 1.10), so you might need to edit the Project.toml by hand

I was able to add the weak dep to ClimaCore with add --weak in Julia v1.11, but I don't seem to be able to add it to the extension.

petebachant · 2025-10-27T23:56:15Z

ext/cuda/cuda_utils.jl

+# Import from CUDA.GPUCompiler, since we can't depend on GPUCompiler directly
+# in an extension package
+import CUDA.GPUCompiler: methodinstance


This would probably be possible with the correct definitions in Project.toml, but is it worth adding all over the place?

It looks like CUDA.jl imports methodinstance internally, so you can just write CUDA.methodinstance to be consistent with the other CUDA._ function calls in this file.

I suppose this isn't a great practice in general, as it involves making use of another package's internal implementation details, but in this case I think it's fine. GPUCompiler.jl is a fundamental part of CUDA.jl, so that internal import is unlikely to go away any time soon. And yes, Julia doesn't allow us to add dependencies for extensions, at least until this issue is fixed.

dennisYatunin · 2025-10-31T01:27:12Z

One hangup I have is the redefine-a-function-to-activate way of using this versus an environmental variable. The latter seems nicer since you can do it without changing your code (driver script?), but the former seems to be the typical pattern. What is the upside of the redefined function approach?

Both of these approaches—redefining a method and changing an environment variables—involve recompiling ClimaCore's CUDA extension to activate this feature, since the feature is disabled in the default version of ClimaCore we precompile for the CI depot. The difference is in how this recompilation is achieved:

When a method is redefined, all other methods that depend on it (either directly or indirectly) are immediately recompiled. The compiler achieves this by tracing through the "backedges" that connect different MethodInstance objects, starting from the MethodInstance that was redefined.
When an environment variable is changed, Julia does not have any mechanism for automatically detecting that change. So, if ClimaCore has a constant flag determined from an environment variable, the variable needs to be changed before ClimaCore is compiled in the current Julia session. Because the compiler is not able to redefine constant variables without starting a new Julia session, the flag will always remain unchanged after ClimaCore is loaded for the first time.

Since starting a new Julia session requires recompiling the Base library, the environment variable approach can end up needing more recompilation than the other approach. So, in the interest of minimizing total compilation time, we typically use method redefinitions to activate optional package features.

On the other hand, if your setup requires you to start a new Julia session anyway, it doesn't really matter which option you choose. Also, the environment variable approach I described assumes that your flag needs to be a compile-time constant; if you're okay with performing an environment lookup every time you access the flag, then the environment variable approach will probably be simpler to implement.

…kernel-naming

petebachant · 2025-11-04T16:22:31Z

Thanks @dennisYatunin. What do you think of the current implementation with the const? Is that acceptable for now?

Co-authored-by: Teja Reddy <[email protected]>

petebachant added 6 commits October 7, 2025 11:34

Add a cache of CUDA kernel names created from the stack trace

78043ba

Create kernel name keys with arg types

687365e

Update kernel naming

26e4ad5

Update naming

3a94f7c

Disable default kernel naming

f3542d4

Put kernel renaming option into DebugOnly

cf830bf

petebachant requested review from daverumph, dennisYatunin and imreddyTeja October 13, 2025 12:38

petebachant mentioned this pull request Oct 14, 2025

Update benchmark_step.jl for CUDA benchmarking with useful kernel names CliMA/ClimaAtmos.jl#4055

Open

4 tasks

Switch back to using object ID alone for kernel name

e1285ae

imreddyTeja reviewed Oct 20, 2025

View reviewed changes

petebachant and others added 5 commits October 20, 2025 14:58

Use methodinstance for kernel name key

06c44e2

Improve readability of kernel names

7411a8d

Merge branch 'main' of github.com:CliMA/ClimaCore.jl into pb/kernel-n…

9dd3116

…aming

Handle packages without Clima in their name

93fcf68

Handle packages without Clima in their name

25f23ee

imreddyTeja reviewed Oct 24, 2025

View reviewed changes

Use src for file as fallback

9f0fa87

imreddyTeja reviewed Oct 27, 2025

View reviewed changes

petebachant added 3 commits October 27, 2025 14:35

Add GPUCompiler as a weakdep

31ec263

Use GPUCompiler method directly

eafbe54

Switch to pure env var based stack trace naming

52ff458

petebachant added 2 commits October 27, 2025 15:32

Use a constant set at compile time

46c1f6c

Remove GPUCompiler weakdep

fca1ad1

petebachant commented Oct 27, 2025

View reviewed changes

petebachant added 2 commits November 3, 2025 04:22

Use methodinstance from CUDA

05a5d5b

Merge branch 'main' of https://github.com/CliMA/ClimaCore.jl into pb/…

e7bca7d

…kernel-naming

petebachant and others added 3 commits November 4, 2025 11:20

Use splitpath to split path

a541746

Co-authored-by: Teja Reddy <[email protected]>

Make kernel naming purely dynamic based on env var reading

7af8d16

Pass args into get_kernel_name

cec2ef7

		kernel_name_exists = key in keys(kernel_names)
		if !kernel_name_exists

	key = objectid(methodinstance(typeof(f!), typeof(args)))
	key = objectid(methodinstance(F!, typeof(args)))

Name CUDA kernels based on stack trace in auto_launch #2376

Are you sure you want to change the base?

Name CUDA kernels based on stack trace in auto_launch #2376

Uh oh!

Conversation

petebachant commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Example results from ClimaAtmos Buildkite

Uh oh!

petebachant commented Oct 14, 2025

Uh oh!

imreddyTeja commented Oct 15, 2025

Uh oh!

petebachant commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

imreddyTeja commented Oct 24, 2025

Uh oh!

petebachant commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petebachant commented Oct 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petebachant Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisYatunin Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisYatunin commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petebachant commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Name CUDA kernels based on stack trace in `auto_launch` #2376

Name CUDA kernels based on stack trace in `auto_launch` #2376

petebachant commented Oct 13, 2025 •

edited

Loading

petebachant commented Oct 20, 2025 •

edited

Loading

petebachant commented Oct 24, 2025 •

edited

Loading

petebachant Oct 27, 2025 •

edited

Loading

dennisYatunin Oct 31, 2025 •

edited

Loading

dennisYatunin commented Oct 31, 2025 •

edited

Loading