[Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer#1877
[Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer#1877Abdennacer-Badaoui wants to merge 4 commits intobitsandbytes-foundation:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
TimDettmers
left a comment
There was a problem hiding this comment.
PR Review: [Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer
Classification: Refactoring / RFC (discussion-only, not intended to merge as-is)
Author: @Abdennacer-Badaoui (known contributor — authored the blocksize-32/64 kernels in #1854/#1856 that this unifies)
Risk level: Low (all files are additions in csrc/examples/, no existing code is modified)
Summary
This PR proposes a design for merging the duplicated CUDA and HIP kernel source files into a unified codebase using two new portability headers: compat.cuh (host-safe) and compat_device.cuh (device-only). The current codebase maintains near-identical copies of 5 pairs of files (~6500 LOC of duplication). The proposed approach would eliminate 5 files and ~3300 lines of duplication while introducing 2 new portability headers.
The 8 example files demonstrate the full approach. This is a well-structured RFC that shows rather than tells.
CI Status
- Lint: FAIL (expected — clang-format likely hasn't been run on the new files)
- build-wheels: FAIL (unrelated — dependency on lint)
- All CUDA/HIP/CPU build & test jobs: PASS (these don't compile
csrc/examples/)
The lint failure is expected for an RFC and is not a concern at this stage.
Design Assessment
The two-header split (compat.cuh for host-safe code, compat_device.cuh for device-only CUB/MMA) is a clean design. The rationale is solid: .cpp files compiled by gcc/g++ cannot parse CUDA device headers, so the split is necessary.
Strengths:
-
Namespace aliasing for CUB/hipCUB (
namespace bnb_cub = cub/hipcub) eliminates ~90% of the mechanicalcub::vshipcub::differences with a single line. Elegant. -
Compile-time
BNB_WARP_SIZEincommon_unified.cuhcorrectly handles the GFX9 (CDNA) 64-wide warps vs RDNA/CUDA 32-wide warps. The#ifdef __GFX9__guard is correct for current ROCm architectures. -
kQuantizeBlockwiseSmallsuccessfully unifieskQuantizeBlockwise32(CUDA) andkQuantizeBlockwise64(HIP) by parameterizing onBNB_WARP_SIZE. The kernel logic is structurally identical to both originals — I verified the codebook values, reduction ops, quantization packing, and store patterns match. -
#if BNB_HIPguards are used sparingly and only where genuinely needed:atomicMax(CUDA CAS loop vs HIP native)Contextclass (cuBLAS vs rocBLAS handle creation)gemmex/strided_gemmex(different BLAS APIs)igemmlt(hipBLAS requires explicit heuristic algo selection)blocksize==64dispatch path inops_unified.cu(only HIP with 64-wide warps needs the small-block kernel for blocksize=64)
-
CMakeLists change is minimal and correct: single
GPU_FILESlist replaces separateCUDA_FILES/HIP_FILES, withset_source_files_properties(${GPU_FILES} PROPERTIES LANGUAGE HIP)for HIP builds. The<<<>>>launch syntax works natively on HIP, so nohipLaunchKernelGGLwrappers are needed.
Technical concerns (for discussion):
-
BNB_WARP_SIZEand blocksize=64 dispatch: Inops_unified.culines 50-61, theblocksize==64path has a#if BNB_HIPguard to dispatch tokQuantizeBlockwiseSmallfor 4-bit types on HIP. However,BNB_WARP_SIZEis a device-side macro (__GFX9__is only defined in device code), while this dispatch decision is made in host code. How will the host-side code know whether to use the warp-64 path? The current approach uses#if BNB_HIPas a proxy, which is correct if the library is compiled separately for each target architecture, but could be wrong if a single HIP binary targets both CDNA (warp64) and RDNA (warp32) architectures simultaneously. This probably needs a runtime check or separate compilation for each arch, or a comment explaining the assumption. -
kQuantizeBlockwiseSmallname: The kernel is called "Small" but on HIP with warp=64, it handles blocksize=64 (not small at all). ConsiderkQuantizeBlockwiseWarpor similar to reflect that it processes warp-sized blocks. Minor naming nit. -
compat.cuhincludesrocblas/rocblas.handhipblas/hipblas.hunconditionally on HIP: These are heavyweight headers. Ifcompat.cuhis meant to be "host-safe and lightweight," consider whether these BLAS includes belong here or in a separate BLAS compat header. Currently theContextclass inops_unified.cuhneeds them, but other files includingcompat.cuhmay not. -
BNB_BLASLT_PTR_MODE_ALPHA_VECasymmetry: On CUDA this maps toCUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO, on HIP toHIPBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST. TheBETA_ZEROvsBETA_HOSTdifference is notable — is this an intentional difference in how the two backends handle beta, or should it beBETA_ZEROon both? This discrepancy exists in the current code, so it's not introduced by this PR, but the unification is a good opportunity to document why. -
Missing
bnb_blasLtPrefCreate/bnb_blasLtPrefSetAttr/bnb_blasLtAlgoGetHeuristicmacros for CUDA: These are defined for HIP incompat.cuhbut not for CUDA, because CUDA doesn't need the heuristic path. However, they're used inside a#if BNB_HIPblock inops_unified.cu, so there's no build failure — but it means the compat header is incomplete if someone tried to use these macros on CUDA. Add a comment or#ifdefguard noting these are HIP-only. -
CUDA_CHECK_RETURNbackward compat macro: Good thatcompat.cuhdefines#define CUDA_CHECK_RETURN(value) BNB_CHECK_RETURN(value)for migration purposes. This should be documented as deprecated and removed after the full migration.
Security Review
- No network access, command execution, or dynamic code execution introduced
- No new dependencies added
- No changes to
pyproject.toml, CI workflows, or agent configuration files - No invisible Unicode characters detected in any file
- Codebook values (FP4 and NF4 lookup tables) are byte-identical to the existing
kernels.cu - CMakeLists changes are limited to file list unification — no new
execute_process,FetchContent, or custom commands - Build flags unchanged
No security concerns.
Numerical Correctness
All quantization/dequantization kernel code is mechanically equivalent to the existing CUDA and HIP kernels. Specifically verified:
fp4_dequantization_lutandnf4_dequantization_lutvalues are identicaldQuantizeFP4,dQuantizeNF4,dDequantizeFP4Tree,dDequantizeNF4logic is identicalatomicMaxCAS loop is correctly guarded with#if !BNB_HIPkQuantizeBlockwisetemplate usesbnb_cub::andBNB_MAX_OPas 1:1 replacementskQuantizeBlockwiseSmalllogic matches bothkQuantizeBlockwise32(CUDA) andkQuantizeBlockwise64(HIP)igemmltpreserves the HIP heuristic path and CUDA direct path
No numerical correctness concerns.
Downstream Impact
None. This PR adds files to csrc/examples/ — it does not modify any compiled source, public API, or serialization format. No downstream impact.
Cross-PR Conflicts
PR #1858 (k-bit blockwise quantization kernels) adds new CUDA kernels. If this RFC proceeds to full migration, the new kernels from #1858 would need to be written using the compat.cuh abstractions rather than raw CUDA APIs. Worth noting for sequencing.
Verdict: APPROVE (as RFC)
This is a well-designed RFC. The portability layer approach is sound, the #if BNB_HIP guards are minimal and limited to genuinely divergent code, and the unified kernel code is a faithful merge of the existing CUDA and HIP sources. The concerns listed above are discussion points for the design, not blockers.
For the full migration, I'd recommend:
- Resolve the warp-size host/device detection question (concern #1 above)
- Add compilation tests that verify the unified files build correctly for both CUDA and HIP
- Run the full test suite on both CUDA and ROCm hardware to verify numerical equivalence
- Sequence this after or coordinate with #1858 to avoid rework
|
@Abdennacer-Badaoui Thanks! This is essentially what I was expecting we could do. I think this is a good way forward. Most of the review comments above make sense as well! |
MI300X Runtime Validation of Unified CUDA/HIP KernelsI validated this PR's unified kernel approach on actual AMD MI300X hardware. The unified files were activated from Test Environment
Test ResultsFull test suite passes with zero failures. Build Fixes RequiredThe unified files from
BranchAll fixes are on SummaryThe unified kernel approach works well. The compat headers successfully abstract CUDA/HIP differences, and the conditional compilation produces correct code for MI300X. The three issues found above are all fixable within the compat header pattern — no fundamental problems with the unification design. |
RFC — Not intended to be merged as-is
This PR proposes a design for merging the duplicated CUDA and HIP kernel sources into a single codebase. The
csrc/examples/directory contains the full unified files demonstrating the approach. This is meant for discussion and feedback before we proceed with a full migration.Problem
We maintain near-identical copies of every GPU kernel:
kernels.cukernels.hipkernels.cuhkernels_hip.cuhops.cuops.hipops.cuhops_hip.cuhcommon.cuhcommon_hip.cuhThe HIP files were originally auto-generated by hipify and manually patched. Every bug fix or new feature must be applied to both copies, and they inevitably drift apart.
Proposed design
Introduce two portability headers:
compat.cuh— Host-safe types and macros (safe to include from.cppfiles)compat_device.cuh— Device-only layer: CUB/hipCUB, reduction ops, MMA (include from.cufiles only)These resolve all mechanical CUDA/HIP differences via macros, type aliases, and namespace aliases:
bnb_cub::→cub::on CUDA,hipcub::on HIPbnb_bfloat16→__nv_bfloat16on CUDA,hip_bfloat16on HIPbnb_stream_t→cudaStream_t/hipStream_tBNB_MAX_OP→cub::Max()/hipcub::Max()BNB_CHECK_RETURN()→ unified error checkingbnb_blasLt*,bnb_sparse*→ cuBLAS/hipBLAS and cuSPARSE/hipSPARSEKernel code uses these abstractions and compiles unmodified on both platforms. The
<<<grid, block>>>launch syntax works natively on HIP, so nohipLaunchKernelGGLwrappers are needed.For HIP builds, CMake simply sets
LANGUAGE HIPon the.cufiles.#if BNB_HIPguards are only needed for genuinely divergent code (~10% of changes):atomicMax(CUDA needs CAS loop, HIP has native)Contextclass (cuBLAS vs rocBLAS handle creation)igemmlt(hipBLAS requires explicit heuristic algo selection)BNB_WARP_SIZEcompile-time constants)The split into two headers is necessary because
.cppfiles (likepythonInterface.cpp) are compiled by the host compiler (gcc/g++), which cannot parse CUB/device headers. Only.cufilescompiled by nvcc/hipcc include
compat_device.cuh.Example files in
csrc/examples/compat.cuhcompat_device.cuhcommon_unified.cuhcommon.cuh+common_hip.cuhkernels_unified.cukernels.cu+kernels.hipops_unified.cuhops_unified.cu#if BNB_HIPfor divergent APIs)pythonInterface_unified.cppCMakeLists_unified.txtGPU_FILESlist)End state after full migration
common_hip.cuh,kernels.hip,kernels_hip.cuh,ops.hip,ops_hip.cuhcommon.cuh,kernels.cu,kernels.cuh,ops.cu,ops.cuh(now unified)compat.cuh,compat_device.cuh