Skip to content

Conversation

@petebachant
Copy link
Member

@petebachant petebachant commented Oct 13, 2025

This PR adds stack trace-based CUDA kernel naming to make performance benchmarks more informative. Kernel names are computed once per session and cached in a module-level Dict.

  • Code follows the style guidelines OR N/A.
  • Unit tests are included OR N/A.
  • Code is exercised in an integration test OR N/A.
  • Documentation has been added/updated OR N/A.

TODO:

  • Confirm we get useful results, i.e., that we don't spend too much time computing the kernel name and wash out the timeline.

Example results from ClimaAtmos Buildkite

From here

image

@petebachant
Copy link
Member Author

Added some example results, which look suspiciously different. Reminds me of Simon's comment about the CUDA API not doing much in the previous implementation that computed the name every time a kernel was created. Do we need to somehow hydrate the kernel name cache initially to get a useful result here?

@imreddyTeja
Copy link
Member

ook suspiciously different. Reminds me of Simon's comment about the CUDA API not doing much in the previous implementation that computed the name every time a kernel was created. Do we need to somehow hydrate the kernel name cache initially to get a useful result here?

Are these profile results from the first step, or a later step? The kernel renaming example takes significantly longer, which makes me think stacktrace is getting called.

@petebachant
Copy link
Member Author

petebachant commented Oct 20, 2025

I think I figured this out. The compile step was running before kernel renaming was enabled. Nope, still taking way too long when this feature is enabled.

"_" *
string(frame.linfo.def.file) *
"_" *
string(frame.line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to either handle the case where frame.linfo.def.file is inside NVTX.jl specially, or to leave a note here that filenames and line numbers of kernels will be incorrect if inside an NVTX annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the format below? If it's an NVTX annotation, it's easy to see that. Likewise, it's easy to tell if the name is a function defined in a file:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot. If a lot of kernels are getting caught in the NVTX annotations, it might make sense to reduce the number of NVTX annotations in downstream packages

Comment on lines +51 to +52
kernel_name_exists = key in keys(kernel_names)
if !kernel_name_exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These syntax changes may be less clear, so feel free to ignore

        kernel_name_exists =  haskey(kernel_names, key)
        if !kernel_name_exists

or

kernel_name = get!(kernel_names, key) do
#calculate_name_here
end

@imreddyTeja
Copy link
Member

I just tried using this in ClimaLand, and it worked. This is so much easier than dev`ing GPUCompiler.jl

@petebachant
Copy link
Member Author

petebachant commented Oct 24, 2025

The files don't make much sense in ClimaAtmos's buildkite runs since I guess they handle dev installs differently:

image

Maybe I should just split on src as a last resort.

@petebachant
Copy link
Member Author

Breaking on src if we can't find any other tokens:

image

kernel_name = nothing
if name_kernels_from_stack_trace()
# Create a key from the method instance and types of the args
key = objectid(methodinstance(typeof(f!), typeof(args)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
key = objectid(methodinstance(typeof(f!), typeof(args)))
key = objectid(methodinstance(F!, typeof(args)))

The function signature could also be changed to

function auto_launch!(
    f!::F!,
    args::ARGS,
    nitems::Union{Integer, Nothing} = nothing;
    auto = false,
    threads_s = nothing,
    blocks_s = nothing,
    always_inline = true,
    caller = :unknown,
) where {F!, ARGS}
.
.
.

 key = objectid(methodinstance(F!, ARGS))

but I think that forced specialization of the method (I'm not sure if that is desirable or not here)

import ClimaCore.DataLayouts
import ClimaCore.DataLayouts: empty_kernel_stats
import ClimaCore.DebugOnly: name_kernels_from_stack_trace
import CUDA.GPUCompiler: methodinstance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if importing a package through another package is recommended or not, and I can't find anything when I look online.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel wrong. Would it be smarter to make GPUCompiler an explicit dependency of ClimaCore? Is there a nice way to make it optional for just the CUDA extension?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a little more about this, I guess it would go in weakdeps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ClimaLand is a good example of using extensions/weak reps. I'm not sure if there is a way to manage extensions with the package manager though (I never found a way for Julia 1.10), so you might need to edit the Project.toml by hand

Copy link
Member Author

@petebachant petebachant Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to add the weak dep to ClimaCore with add --weak in Julia v1.11, but I don't seem to be able to add it to the extension.

Comment on lines 5 to 7
# Import from CUDA.GPUCompiler, since we can't depend on GPUCompiler directly
# in an extension package
import CUDA.GPUCompiler: methodinstance
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be possible with the correct definitions in Project.toml, but is it worth adding all over the place?

Copy link
Member

@dennisYatunin dennisYatunin Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like CUDA.jl imports methodinstance internally, so you can just write CUDA.methodinstance to be consistent with the other CUDA._ function calls in this file.

I suppose this isn't a great practice in general, as it involves making use of another package's internal implementation details, but in this case I think it's fine. GPUCompiler.jl is a fundamental part of CUDA.jl, so that internal import is unlikely to go away any time soon. And yes, Julia doesn't allow us to add dependencies for extensions, at least until this issue is fixed.

@dennisYatunin
Copy link
Member

dennisYatunin commented Oct 31, 2025

One hangup I have is the redefine-a-function-to-activate way of using this versus an environmental variable. The latter seems nicer since you can do it without changing your code (driver script?), but the former seems to be the typical pattern. What is the upside of the redefined function approach?

Both of these approaches—redefining a method and changing an environment variables—involve recompiling ClimaCore's CUDA extension to activate this feature, since the feature is disabled in the default version of ClimaCore we precompile for the CI depot. The difference is in how this recompilation is achieved:

  • When a method is redefined, all other methods that depend on it (either directly or indirectly) are immediately recompiled. The compiler achieves this by tracing through the "backedges" that connect different MethodInstance objects, starting from the MethodInstance that was redefined.
  • When an environment variable is changed, Julia does not have any mechanism for automatically detecting that change. So, if ClimaCore has a constant flag determined from an environment variable, the variable needs to be changed before ClimaCore is compiled in the current Julia session. Because the compiler is not able to redefine constant variables without starting a new Julia session, the flag will always remain unchanged after ClimaCore is loaded for the first time.

Since starting a new Julia session requires recompiling the Base library, the environment variable approach can end up needing more recompilation than the other approach. So, in the interest of minimizing total compilation time, we typically use method redefinitions to activate optional package features.

On the other hand, if your setup requires you to start a new Julia session anyway, it doesn't really matter which option you choose. Also, the environment variable approach I described assumes that your flag needs to be a compile-time constant; if you're okay with performing an environment lookup every time you access the flag, then the environment variable approach will probably be simpler to implement.

@petebachant
Copy link
Member Author

Thanks @dennisYatunin. What do you think of the current implementation with the const? Is that acceptable for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants