Skip to content

Conversation

@jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Mar 5, 2025

Summary:
These require +ptx features to be set even though they're guarded by
the __nvvm_reflect. Rather than figure out how to hack around that
with the target attribute I'm just going to disable it for 'generic'
builds and use the slow version for now.

@jhuber6 jhuber6 requested review from Artem-B, arsenm and shiltian March 5, 2025 20:06
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics labels Mar 5, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 5, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-x86

Author: Joseph Huber (jhuber6)

Changes

Summary:
These require +ptx features to be set even though they're guarded by
the __nvvm_reflect. Rather than figure out how to hack around that
with the target attribute I'm just going to disable it for 'generic'
builds and use the slow version for now.


Full diff: https://github.com/llvm/llvm-project/pull/129927.diff

1 Files Affected:

  • (modified) clang/lib/Headers/nvptxintrin.h (+9-1)
diff --git a/clang/lib/Headers/nvptxintrin.h b/clang/lib/Headers/nvptxintrin.h
index 29d0adcabc82f..b2c3097a464fe 100644
--- a/clang/lib/Headers/nvptxintrin.h
+++ b/clang/lib/Headers/nvptxintrin.h
@@ -179,8 +179,10 @@ __gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x,
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
 __gpu_match_any_u32(uint64_t __lane_mask, uint32_t __x) {
   // Newer targets can use the dedicated CUDA support.
-  if (__CUDA_ARCH__ >= 700 || __nvvm_reflect("__CUDA_ARCH") >= 700)
+#if __CUDA_ARCH__ >= 700
+  if (__nvvm_reflect("__CUDA_ARCH") >= 700)
     return __nvvm_match_any_sync_i32(__lane_mask, __x);
+#endif
 
   uint32_t __match_mask = 0;
   bool __done = 0;
@@ -200,8 +202,10 @@ __gpu_match_any_u32(uint64_t __lane_mask, uint32_t __x) {
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
 __gpu_match_any_u64(uint64_t __lane_mask, uint64_t __x) {
   // Newer targets can use the dedicated CUDA support.
+#if __CUDA_ARCH__ >= 700
   if (__CUDA_ARCH__ >= 700 || __nvvm_reflect("__CUDA_ARCH") >= 700)
     return __nvvm_match_any_sync_i64(__lane_mask, __x);
+#endif
 
   uint64_t __match_mask = 0;
 
@@ -223,9 +227,11 @@ __gpu_match_any_u64(uint64_t __lane_mask, uint64_t __x) {
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
 __gpu_match_all_u32(uint64_t __lane_mask, uint32_t __x) {
   // Newer targets can use the dedicated CUDA support.
+#if __CUDA_ARCH__ >= 700
   int predicate;
   if (__CUDA_ARCH__ >= 700 || __nvvm_reflect("__CUDA_ARCH") >= 700)
     return __nvvm_match_all_sync_i32p(__lane_mask, __x, &predicate);
+#endif
 
   uint32_t __first = __gpu_read_first_lane_u64(__lane_mask, __x);
   uint64_t __ballot = __gpu_ballot(__lane_mask, __x == __first);
@@ -236,9 +242,11 @@ __gpu_match_all_u32(uint64_t __lane_mask, uint32_t __x) {
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
 __gpu_match_all_u64(uint64_t __lane_mask, uint64_t __x) {
   // Newer targets can use the dedicated CUDA support.
+#if __CUDA_ARCH__ >= 700
   int predicate;
   if (__CUDA_ARCH__ >= 700 || __nvvm_reflect("__CUDA_ARCH") >= 700)
     return __nvvm_match_all_sync_i64p(__lane_mask, __x, &predicate);
+#endif
 
   uint64_t __first = __gpu_read_first_lane_u64(__lane_mask, __x);
   uint64_t __ballot = __gpu_ballot(__lane_mask, __x == __first);

// Newer targets can use the dedicated CUDA support.
if (__CUDA_ARCH__ >= 700 || __nvvm_reflect("__CUDA_ARCH") >= 700)
#if __CUDA_ARCH__ >= 700
if (__nvvm_reflect("__CUDA_ARCH") >= 700)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason that __CUDA_ARCH__ got removed from if here but not in other places?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we're already checking for __CUDA_ARCH__ on preprocessor level, is there any point to use __nvvm_reflect() at all?

IIUIC, the original code was intended to be compiled into "generic" IR, and it had to rely on __nvvm_reflect() to do things differently if/when it eventually ends up targeting a newer GPU, but now that the check is made up front, I'm not quite sure that we need it any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just because I forgot to but it was a no-op so I didn't notice.

And yeah, I wish there was a way to let nvvm_reflect call these without the incurring the wrath of the compiler. You can spoof some target attributes but that's always broken when inlined.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should drop the nvvm reflect here. Really shouldn't have any subtarget dependent code here. Injecting implementation details into the source program is part of the fundamental issue with device lib linking

Summary:
These require `+ptx` features to be set even though they're guarded by
the `__nvvm_reflect`. Rather than figure out how to hack around that
with the `target` attribute I'm just going to disable it for 'generic'
builds and use the slow version for now.
@jhuber6 jhuber6 merged commit ca39a02 into llvm:main Mar 7, 2025
11 checks passed
@jhuber6 jhuber6 deleted the AsmMatch branch March 7, 2025 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants