-
Notifications
You must be signed in to change notification settings - Fork 150
Deduplicate calc_chunk_indices_kernel
#1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate calc_chunk_indices_kernel
#1657
Conversation
|
@robertmaynard can you review this PR as well? I was under the impression that launching a kernel across TUs would not work in CUDA whole compilation mode but here it seems to be working. Aren't kernels supposed to have their symbols hidden too? |
robertmaynard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is invalid and goes against the guidance in https://developer.nvidia.com/blog/cuda-c-compiler-updates-impacting-elf-visibility-and-linkage/
This currently only work in cuvs as we have failed to remove: https://github.com/rapidsai/cuvs/blob/main/cpp/cmake/modules/ConfigureCUDA.cmake#L38
|
Ohh I see okay 🥲 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel needs to be launched in the same TU as which it is defined. We can (but should ideally avoid) pass the pointer around to other TUs but they shouldn't be attempting to launch the kernel.
|
Hi @robertmaynard , can you check this PR? I've added the changes to follow the guidelines |
|
/merge |
Closes #1577
Reduces binary size by deduplicating
calc_chunk_indices_kernel. This PR reduces instantiations from 62 -> 1 for each template (BlockDim=32, 64, ..., 1024)Binary Size Changes
CUDA 12.9: 1096.15MB ->
CUDA 13: 432.98 MB->