Skip to content

Comments

[Refactor] Add global helper to deduplicate vectorized memory ops#35105

Open
LopezCastroRoberto wants to merge 3 commits intovllm-project:mainfrom
LopezCastroRoberto:refactor/256b_instr
Open

[Refactor] Add global helper to deduplicate vectorized memory ops#35105
LopezCastroRoberto wants to merge 3 commits intovllm-project:mainfrom
LopezCastroRoberto:refactor/256b_instr

Conversation

@LopezCastroRoberto
Copy link
Contributor

@LopezCastroRoberto LopezCastroRoberto commented Feb 23, 2026

PR #35210 has to be merged first

Summary

  • Extract duplicated 256-bit PTX load/store primitives, type converters, and vector traits into a shared csrc/cuda_vec_utils.cuh header.
  • Add compile-time CUDA_VERSION >= 12090 guard to activation kernel (csrc/activation_kernels.cu) launch macros so the 256-bit path is only selected when the toolkit supports it.
  • Remove dead code (PackedTraits, wrapper functions).

New PRs that introduce 256-bit instructions (e.g., #32957, #34917) should use the cuda_vec_utils.cuh helper to prevent code duplication and improve long-term maintainability.


Tested on SM120+cuda12.8 and SM103+cuda13.0

  • tests/kernels/quantization/test_nvfp4_quant.py -- ALL TEST PASSED
  • tests/kernels/quantization/test_silu_mul_nvfp4_quant.py -- ALL TEST PASSED
  • tests/kernels/core/test_activation.py -- ALL TEST PASSED

No performance regressions detected

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as draft February 23, 2026 14:50
@mergify mergify bot added the nvidia label Feb 23, 2026
@mergify
Copy link

mergify bot commented Feb 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great refactoring that centralizes vectorized memory operations into a new csrc/cuda_vec_utils.cuh header. This significantly reduces code duplication and improves maintainability across various CUDA kernels. The new utilities are well-designed and make the code more readable and generic. I have one suggestion to further improve the robustness of the new TypeConverter utility by ensuring it fails at compile-time for unspecialized types.

Comment on lines +54 to +57
template <typename T>
struct TypeConverter {
using Type = half2;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default implementation for TypeConverter is risky as it silently defaults to half2 for any type T that doesn't have an explicit specialization. This could lead to subtle and hard-to-debug errors if a new scalar type is used with a kernel that relies on TypeConverter and a specialization is forgotten. It would be safer to enforce specialization by triggering a compile-time error for unhandled types.

template <typename T>
struct TypeConverter {
  // This template must be specialized for each type.
  static_assert(sizeof(T) == 0, "TypeConverter is not specialized for this type.");
};

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review February 24, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant