-
Notifications
You must be signed in to change notification settings - Fork 16
Name CUDA kernels based on stack trace in auto_launch
#2376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Added some example results, which look suspiciously different. Reminds me of Simon's comment about the CUDA API not doing much in the previous implementation that computed the name every time a kernel was created. Do we need to somehow hydrate the kernel name cache initially to get a useful result here? |
Are these profile results from the first step, or a later step? The kernel renaming example takes significantly longer, which makes me think |
|
|
ext/cuda/cuda_utils.jl
Outdated
| "_" * | ||
| string(frame.linfo.def.file) * | ||
| "_" * | ||
| string(frame.line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to either handle the case where frame.linfo.def.file is inside NVTX.jl specially, or to leave a note here that filenames and line numbers of kernels will be incorrect if inside an NVTX annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot. If a lot of kernels are getting caught in the NVTX annotations, it might make sense to reduce the number of NVTX annotations in downstream packages
| kernel_name_exists = key in keys(kernel_names) | ||
| if !kernel_name_exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These syntax changes may be less clear, so feel free to ignore
kernel_name_exists = haskey(kernel_names, key)
if !kernel_name_exists
or
kernel_name = get!(kernel_names, key) do
#calculate_name_here
end
|
I just tried using this in ClimaLand, and it worked. This is so much easier than dev`ing GPUCompiler.jl |
ext/cuda/cuda_utils.jl
Outdated
| kernel_name = nothing | ||
| if name_kernels_from_stack_trace() | ||
| # Create a key from the method instance and types of the args | ||
| key = objectid(methodinstance(typeof(f!), typeof(args))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| key = objectid(methodinstance(typeof(f!), typeof(args))) | |
| key = objectid(methodinstance(F!, typeof(args))) |
The function signature could also be changed to
function auto_launch!(
f!::F!,
args::ARGS,
nitems::Union{Integer, Nothing} = nothing;
auto = false,
threads_s = nothing,
blocks_s = nothing,
always_inline = true,
caller = :unknown,
) where {F!, ARGS}
.
.
.
key = objectid(methodinstance(F!, ARGS))
but I think that forced specialization of the method (I'm not sure if that is desirable or not here)
ext/cuda/cuda_utils.jl
Outdated
| import ClimaCore.DataLayouts | ||
| import ClimaCore.DataLayouts: empty_kernel_stats | ||
| import ClimaCore.DebugOnly: name_kernels_from_stack_trace | ||
| import CUDA.GPUCompiler: methodinstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if importing a package through another package is recommended or not, and I can't find anything when I look online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel wrong. Would it be smarter to make GPUCompiler an explicit dependency of ClimaCore? Is there a nice way to make it optional for just the CUDA extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading a little more about this, I guess it would go in weakdeps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ClimaLand is a good example of using extensions/weak reps. I'm not sure if there is a way to manage extensions with the package manager though (I never found a way for Julia 1.10), so you might need to edit the Project.toml by hand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to add the weak dep to ClimaCore with add --weak in Julia v1.11, but I don't seem to be able to add it to the extension.
ext/cuda/cuda_utils.jl
Outdated
| # Import from CUDA.GPUCompiler, since we can't depend on GPUCompiler directly | ||
| # in an extension package | ||
| import CUDA.GPUCompiler: methodinstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be possible with the correct definitions in Project.toml, but is it worth adding all over the place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like CUDA.jl imports methodinstance internally, so you can just write CUDA.methodinstance to be consistent with the other CUDA._ function calls in this file.
I suppose this isn't a great practice in general, as it involves making use of another package's internal implementation details, but in this case I think it's fine. GPUCompiler.jl is a fundamental part of CUDA.jl, so that internal import is unlikely to go away any time soon. And yes, Julia doesn't allow us to add dependencies for extensions, at least until this issue is fixed.
Both of these approaches—redefining a method and changing an environment variables—involve recompiling ClimaCore's CUDA extension to activate this feature, since the feature is disabled in the default version of ClimaCore we precompile for the CI depot. The difference is in how this recompilation is achieved:
Since starting a new Julia session requires recompiling the Base library, the environment variable approach can end up needing more recompilation than the other approach. So, in the interest of minimizing total compilation time, we typically use method redefinitions to activate optional package features. On the other hand, if your setup requires you to start a new Julia session anyway, it doesn't really matter which option you choose. Also, the environment variable approach I described assumes that your flag needs to be a compile-time constant; if you're okay with performing an environment lookup every time you access the flag, then the environment variable approach will probably be simpler to implement. |
|
Thanks @dennisYatunin. What do you think of the current implementation with the |



This PR adds stack trace-based CUDA kernel naming to make performance benchmarks more informative. Kernel names are computed once per session and cached in a module-level Dict.
TODO:
Example results from ClimaAtmos Buildkite
From here