[Refactor] Add global helper to deduplicate vectorized memory ops#35105
[Refactor] Add global helper to deduplicate vectorized memory ops#35105LopezCastroRoberto wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request is a great refactoring that centralizes vectorized memory operations into a new csrc/cuda_vec_utils.cuh header. This significantly reduces code duplication and improves maintainability across various CUDA kernels. The new utilities are well-designed and make the code more readable and generic. I have one suggestion to further improve the robustness of the new TypeConverter utility by ensuring it fails at compile-time for unspecialized types.
| template <typename T> | ||
| struct TypeConverter { | ||
| using Type = half2; | ||
| }; |
There was a problem hiding this comment.
The default implementation for TypeConverter is risky as it silently defaults to half2 for any type T that doesn't have an explicit specialization. This could lead to subtle and hard-to-debug errors if a new scalar type is used with a kernel that relies on TypeConverter and a specialization is forgotten. It would be safer to enforce specialization by triggering a compile-time error for unhandled types.
template <typename T>
struct TypeConverter {
// This template must be specialized for each type.
static_assert(sizeof(T) == 0, "TypeConverter is not specialized for this type.");
};
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
PR #35210 has to be merged first
Summary
csrc/cuda_vec_utils.cuhheader.CUDA_VERSION >= 12090guard to activation kernel (csrc/activation_kernels.cu) launch macros so the 256-bit path is only selected when the toolkit supports it.PackedTraits, wrapper functions).New PRs that introduce 256-bit instructions (e.g., #32957, #34917) should use the
cuda_vec_utils.cuhhelper to prevent code duplication and improve long-term maintainability.Tested on SM120+cuda12.8 and SM103+cuda13.0
No performance regressions detected