-
Notifications
You must be signed in to change notification settings - Fork 13.4k
CUDA Copy Kernel for Contiguous Tensors for GGML CPY OP #16471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Copy Kernel for Contiguous Tensors for GGML CPY OP #16471
Conversation
ggml/src/ggml-cuda/cpy.cu
Outdated
|
||
const int elements_per_thread = 4; | ||
const int threads_needed = (ne_elements + elements_per_thread - 1) / elements_per_thread; | ||
const int num_blocks = max(1, min(65535, (threads_needed + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't work if ne_elements
is larger than a certain amount (I think 16726016). We can add an assert here or check whether num_blocks limit can be higher than 65535
ggml/src/ggml-cuda/cpy.cu
Outdated
const int64_t remaining = ne_elements - base_idx; | ||
|
||
if (remaining >= elements_per_thread) { | ||
if (base_idx % 4 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I understand base_idx is always a multiple of elements_per_threads
which is 4, so this check is not neccessary?
ggml/src/ggml-cuda/cpy.cu
Outdated
|
||
T * cdst = (cdst_indirect != nullptr) ? cdst_indirect[graph_cpynode_index] : cdst_direct; | ||
|
||
const int elements_per_thread = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is declared both while launching the kernel and here, perhaps make it a constant like CUDA_CPY_BLOCK_SIZE
ggml/src/ggml-cuda/cpy.cu
Outdated
} | ||
} | ||
} else { | ||
for (int j = 0; j < remaining; ++j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since here remaining
is < 4, we can do a unroll like below, but I doubt it will have any affect on performance
#pragma unroll
for (int j = 0; j < 4; ++j) {
size_t i = base + (size_t)j;
if (i < ne_elements) cdst[i] = cx[i];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is this kernel does not properly check for memory alignment. When you copy a float4
this is done as a single, 16 byte transfer. However, if the pointer is not aligned to 16 byte this will result in a crash.
I would suggest you look at
llama.cpp/ggml/src/ggml-cuda/common.cuh
Lines 301 to 312 in 2c0d875
// Maximum number of bytes that can be copied in a single instruction. | |
static constexpr __device__ int ggml_cuda_get_max_cpy_bytes() { | |
#ifdef GGML_USE_HIP | |
return 16; | |
#else | |
#if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA | |
return 16; | |
#else | |
return 8; | |
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA | |
#endif // GGML_USE_HIP | |
} |
llama.cpp/ggml/src/ggml-cuda/common.cuh
Lines 573 to 597 in 2c0d875
// Aligned memory transfers of 8/16 bytes can be faster than 2 transfers with 4 bytes, especially on AMD. | |
template <int nbytes, int alignment = 0> | |
static __device__ __forceinline__ void ggml_cuda_memcpy_1(void * __restrict__ dst, const void * __restrict__ src) { | |
if constexpr (alignment != 0) { | |
static_assert(nbytes % alignment == 0, "bad alignment"); | |
} | |
constexpr int nb_per_cpy = alignment == 0 ? nbytes : alignment; | |
#pragma unroll | |
for (int i = 0; i < nbytes/nb_per_cpy; ++i) { | |
if constexpr (nb_per_cpy == 1) { | |
((char *) dst)[i] = ((const char *) src)[i]; | |
} else if constexpr (nb_per_cpy == 2) { | |
((short *) dst)[i] = ((const short *) src)[i]; | |
} else if constexpr (nb_per_cpy == 4) { | |
((int *) dst)[i] = ((const int *) src)[i]; | |
} else if constexpr (nb_per_cpy == 8) { | |
((int2 *) dst)[i] = ((const int2 *) src)[i]; | |
} else if constexpr (nb_per_cpy == 16) { | |
((int4 *) dst)[i] = ((const int4 *) src)[i]; | |
} else { | |
static_assert(nbytes == 0 && nbytes == -1, "bad nbytes"); | |
} | |
} | |
} |
- Add a template parameter for the alignment of the copy.
- At runtime, check the alignment of the tensors and run the template specialization with the maximum memory alignment supported by the hardware (I think this will need a utility function to fetch this property in host code).
- In the kernel use
ggml_cuda_memcpy_1
exactly once per thread to get optimal memory alignment. Avoid using the function with an alignment smaller than the copy size since this will result in suboptimal memory bandwidth. - (Maybe change the kernel to use
char *
to make it more generally applicable.)
Is this kernel faster than the previous |
It's still used: llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 2698 to 2704 in 2c0d875
|
The return value is completely ignored. Even if it wasn't, the reason it was necessary in the first place is because we used |
Right, you meant removing the function call... |
If the in-code comment regarding CUDA graph support is outdated then my opinion is that we should simply use |
The pointers in mamba should be the same on every token, so I don't think the indirection is necessary. |
Thanks! I wasn't aware that pointer indirection wasn't required here, appreciate the insight. I tested this locally by deleting the following section: llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 2691 to 2704 in 2c0d875
and modifying these lines to always use llama.cpp/ggml/src/ggml-cuda/cpy.cu Lines 332 to 336 in 2c0d875
With those changes, CUDA Graph execution ran without any issues, and performance (for Nemotron Nano v2) was as follows:
Testing as per contribution guidelines also didn't raise any new issues. Based on this, it seems safe to remove the copy op pointer indirection code and revert to using |
In PR 16328, CUDA Graph support for the Nemotron Nano v2 (NemotronH) model was enabled by replacing use of cudaMemcpyAsync with an existing CUDA copy kernel for copy of contiguous tensors. However, that kernel is optimized for non-contiguous tensors.
This PR introduces a CUDA copy kernel for contiguous GGML tensors, which provides a performance improvement of ~3.7% for Nemotron Nano v2 on RTX 5090.
Results (RTX 5090):
Weights: bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Quantization: Q4_K_M
Performance before:
Performance after: