[Clang] Rename GPU intrinsic functions from `__gpu_` to `_gpu_` #118674

jhuber6 · 2024-12-04T17:57:04Z

Summary:
This is consistent with other intrinsic headers like the SSE/AVX
intrinsics. I don't think function names need to be specificlaly
reserved because we are not natively including this into any TUs. The
main reason to do this change is because LSP providers like clangd
intentionally ignore autocompleting __ prefixed names as they are
considered internal. This makes using this header really, really
annoying.

llvmbot · 2024-12-04T17:57:38Z

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-libc

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
This is consistent with other intrinsic headers like the SSE/AVX
intrinsics. I don't think function names need to be specificlaly
reserved because we are not natively including this into any TUs. The
main reason to do this change is because LSP providers like clangd
intentionally ignore autocompleting __ prefixed names as they are
considered internal. This makes using this header really, really
annoying.

Patch is 39.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118674.diff

6 Files Affected:

(modified) clang/lib/Headers/amdgpuintrin.h (+27-27)
(modified) clang/lib/Headers/gpuintrin.h (+49-49)
(modified) clang/lib/Headers/nvptxintrin.h (+34-34)
(modified) clang/test/Headers/gpuintrin.c (+82-82)
(modified) clang/test/Headers/gpuintrin_lang.c (+2-2)
(modified) libc/shared/rpc_util.h (+7-7)

diff --git a/clang/lib/Headers/amdgpuintrin.h b/clang/lib/Headers/amdgpuintrin.h
index 720674a85f52cf..07330061647915 100644
--- a/clang/lib/Headers/amdgpuintrin.h
+++ b/clang/lib/Headers/amdgpuintrin.h
@@ -34,90 +34,90 @@ _Pragma("omp begin declare variant match(device = {arch(amdgcn)})");
 #define __gpu_kernel __attribute__((amdgpu_kernel, visibility("protected")))
 
 // Returns the number of workgroups in the 'x' dimension of the grid.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_x(void) {
   return __builtin_amdgcn_grid_size_x() / __builtin_amdgcn_workgroup_size_x();
 }
 
 // Returns the number of workgroups in the 'y' dimension of the grid.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_y(void) {
   return __builtin_amdgcn_grid_size_y() / __builtin_amdgcn_workgroup_size_y();
 }
 
 // Returns the number of workgroups in the 'z' dimension of the grid.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_z(void) {
   return __builtin_amdgcn_grid_size_z() / __builtin_amdgcn_workgroup_size_z();
 }
 
 // Returns the 'x' dimension of the current AMD workgroup's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_x(void) {
   return __builtin_amdgcn_workgroup_id_x();
 }
 
 // Returns the 'y' dimension of the current AMD workgroup's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_y(void) {
   return __builtin_amdgcn_workgroup_id_y();
 }
 
 // Returns the 'z' dimension of the current AMD workgroup's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_z(void) {
   return __builtin_amdgcn_workgroup_id_z();
 }
 
 // Returns the number of workitems in the 'x' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_x(void) {
   return __builtin_amdgcn_workgroup_size_x();
 }
 
 // Returns the number of workitems in the 'y' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_y(void) {
   return __builtin_amdgcn_workgroup_size_y();
 }
 
 // Returns the number of workitems in the 'z' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_z(void) {
   return __builtin_amdgcn_workgroup_size_z();
 }
 
 // Returns the 'x' dimension id of the workitem in the current AMD workgroup.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_x(void) {
   return __builtin_amdgcn_workitem_id_x();
 }
 
 // Returns the 'y' dimension id of the workitem in the current AMD workgroup.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_y(void) {
   return __builtin_amdgcn_workitem_id_y();
 }
 
 // Returns the 'z' dimension id of the workitem in the current AMD workgroup.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_z(void) {
   return __builtin_amdgcn_workitem_id_z();
 }
 
 // Returns the size of an AMD wavefront, either 32 or 64 depending on hardware
 // and compilation options.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_lanes(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_lanes(void) {
   return __builtin_amdgcn_wavefrontsize();
 }
 
 // Returns the id of the thread inside of an AMD wavefront executing together.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_id(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_lane_id(void) {
   return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
 }
 
 // Returns the bit-mask of active threads in the current wavefront.
-_DEFAULT_FN_ATTRS static __inline__ uint64_t __gpu_lane_mask(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint64_t _gpu_lane_mask(void) {
   return __builtin_amdgcn_read_exec();
 }
 
 // Copies the value from the first active thread in the wavefront to the rest.
 _DEFAULT_FN_ATTRS static __inline__ uint32_t
-__gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t __x) {
+_gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t __x) {
   return __builtin_amdgcn_readfirstlane(__x);
 }
 
 // Copies the value from the first active thread in the wavefront to the rest.
 _DEFAULT_FN_ATTRS __inline__ uint64_t
-__gpu_read_first_lane_u64(uint64_t __lane_mask, uint64_t __x) {
+_gpu_read_first_lane_u64(uint64_t __lane_mask, uint64_t __x) {
   uint32_t __hi = (uint32_t)(__x >> 32ull);
   uint32_t __lo = (uint32_t)(__x & 0xFFFFFFFF);
   return ((uint64_t)__builtin_amdgcn_readfirstlane(__hi) << 32ull) |
@@ -125,33 +125,33 @@ __gpu_read_first_lane_u64(uint64_t __lane_mask, uint64_t __x) {
 }
 
 // Returns a bitmask of threads in the current lane for which \p x is true.
-_DEFAULT_FN_ATTRS static __inline__ uint64_t __gpu_ballot(uint64_t __lane_mask,
-                                                          bool __x) {
+_DEFAULT_FN_ATTRS static __inline__ uint64_t _gpu_ballot(uint64_t __lane_mask,
+                                                         bool __x) {
   // The lane_mask & gives the nvptx semantics when lane_mask is a subset of
   // the active threads
   return __lane_mask & __builtin_amdgcn_ballot_w64(__x);
 }
 
 // Waits for all the threads in the block to converge and issues a fence.
-_DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_threads(void) {
+_DEFAULT_FN_ATTRS static __inline__ void _gpu_sync_threads(void) {
   __builtin_amdgcn_s_barrier();
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup");
 }
 
 // Wait for all threads in the wavefront to converge, this is a noop on AMDGPU.
-_DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_lane(uint64_t __lane_mask) {
+_DEFAULT_FN_ATTRS static __inline__ void _gpu_sync_lane(uint64_t __lane_mask) {
   __builtin_amdgcn_wave_barrier();
 }
 
 // Shuffles the the lanes inside the wavefront according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ uint32_t
-__gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x) {
+_gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x) {
   return __builtin_amdgcn_ds_bpermute(__idx << 2, __x);
 }
 
 // Shuffles the the lanes inside the wavefront according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
-__gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x) {
+_gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x) {
   uint32_t __hi = (uint32_t)(__x >> 32ull);
   uint32_t __lo = (uint32_t)(__x & 0xFFFFFFFF);
   return ((uint64_t)__builtin_amdgcn_ds_bpermute(__idx << 2, __hi) << 32ull) |
@@ -159,24 +159,24 @@ __gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x) {
 }
 
 // Returns true if the flat pointer points to CUDA 'shared' memory.
-_DEFAULT_FN_ATTRS static __inline__ bool __gpu_is_ptr_local(void *ptr) {
+_DEFAULT_FN_ATTRS static __inline__ bool _gpu_is_ptr_local(void *ptr) {
   return __builtin_amdgcn_is_shared((void __attribute__((address_space(0))) *)((
       void [[clang::opencl_generic]] *)ptr));
 }
 
 // Returns true if the flat pointer points to CUDA 'local' memory.
-_DEFAULT_FN_ATTRS static __inline__ bool __gpu_is_ptr_private(void *ptr) {
+_DEFAULT_FN_ATTRS static __inline__ bool _gpu_is_ptr_private(void *ptr) {
   return __builtin_amdgcn_is_private((void __attribute__((
       address_space(0))) *)((void [[clang::opencl_generic]] *)ptr));
 }
 
 // Terminates execution of the associated wavefront.
-_DEFAULT_FN_ATTRS [[noreturn]] static __inline__ void __gpu_exit(void) {
+_DEFAULT_FN_ATTRS [[noreturn]] static __inline__ void _gpu_exit(void) {
   __builtin_amdgcn_endpgm();
 }
 
 // Suspend the thread briefly to assist the scheduler during busy loops.
-_DEFAULT_FN_ATTRS static __inline__ void __gpu_thread_suspend(void) {
+_DEFAULT_FN_ATTRS static __inline__ void _gpu_thread_suspend(void) {
   __builtin_amdgcn_s_sleep(2);
 }
 
diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h
index 4c463c333308fc..be4ab81f6c961e 100644
--- a/clang/lib/Headers/gpuintrin.h
+++ b/clang/lib/Headers/gpuintrin.h
@@ -48,56 +48,56 @@ _Pragma("omp begin declare variant match(device = {kind(gpu)})");
 #define __GPU_Z_DIM 2
 
 // Returns the number of blocks in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_num_blocks_x();
+    return _gpu_num_blocks_x();
   case 1:
-    return __gpu_num_blocks_y();
+    return _gpu_num_blocks_y();
   case 2:
-    return __gpu_num_blocks_z();
+    return _gpu_num_blocks_z();
   default:
     __builtin_unreachable();
   }
 }
 
 // Returns the number of block id in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_block_id_x();
+    return _gpu_block_id_x();
   case 1:
-    return __gpu_block_id_y();
+    return _gpu_block_id_y();
   case 2:
-    return __gpu_block_id_z();
+    return _gpu_block_id_z();
   default:
     __builtin_unreachable();
   }
 }
 
 // Returns the number of threads in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_num_threads_x();
+    return _gpu_num_threads_x();
   case 1:
-    return __gpu_num_threads_y();
+    return _gpu_num_threads_y();
   case 2:
-    return __gpu_num_threads_z();
+    return _gpu_num_threads_z();
   default:
     __builtin_unreachable();
   }
 }
 
 // Returns the thread id in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_thread_id_x();
+    return _gpu_thread_id_x();
   case 1:
-    return __gpu_thread_id_y();
+    return _gpu_thread_id_y();
   case 2:
-    return __gpu_thread_id_z();
+    return _gpu_thread_id_z();
   default:
     __builtin_unreachable();
   }
@@ -105,83 +105,83 @@ _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id(int __dim) {
 
 // Get the first active thread inside the lane.
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
-__gpu_first_lane_id(uint64_t __lane_mask) {
+_gpu_first_lane_id(uint64_t __lane_mask) {
   return __builtin_ffsll(__lane_mask) - 1;
 }
 
 // Conditional that is only true for a single thread in a lane.
 _DEFAULT_FN_ATTRS static __inline__ bool
-__gpu_is_first_in_lane(uint64_t __lane_mask) {
-  return __gpu_lane_id() == __gpu_first_lane_id(__lane_mask);
+_gpu_is_first_in_lane(uint64_t __lane_mask) {
+  return _gpu_lane_id() == _gpu_first_lane_id(__lane_mask);
 }
 
 // Gets the first floating point value from the active lanes.
 _DEFAULT_FN_ATTRS static __inline__ float
-__gpu_read_first_lane_f32(uint64_t __lane_mask, float __x) {
+_gpu_read_first_lane_f32(uint64_t __lane_mask, float __x) {
   return __builtin_bit_cast(
-      float, __gpu_read_first_lane_u32(__lane_mask,
-                                       __builtin_bit_cast(uint32_t, __x)));
+      float,
+      _gpu_read_first_lane_u32(__lane_mask, __builtin_bit_cast(uint32_t, __x)));
 }
 
 // Gets the first floating point value from the active lanes.
 _DEFAULT_FN_ATTRS static __inline__ double
-__gpu_read_first_lane_f64(uint64_t __lane_mask, double __x) {
+_gpu_read_first_lane_f64(uint64_t __lane_mask, double __x) {
   return __builtin_bit_cast(
-      double, __gpu_read_first_lane_u64(__lane_mask,
-                                        __builtin_bit_cast(uint64_t, __x)));
+      double,
+      _gpu_read_first_lane_u64(__lane_mask, __builtin_bit_cast(uint64_t, __x)));
 }
 
 // Shuffles the the lanes according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ float
-__gpu_shuffle_idx_f32(uint64_t __lane_mask, uint32_t __idx, float __x) {
+_gpu_shuffle_idx_f32(uint64_t __lane_mask, uint32_t __idx, float __x) {
   return __builtin_bit_cast(
-      float, __gpu_shuffle_idx_u32(__lane_mask, __idx,
-                                   __builtin_bit_cast(uint32_t, __x)));
+      float, _gpu_shuffle_idx_u32(__lane_mask, __idx,
+                                  __builtin_bit_cast(uint32_t, __x)));
 }
 
 // Shuffles the the lanes according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ double
-__gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t __idx, double __x) {
+_gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t __idx, double __x) {
   return __builtin_bit_cast(
-      double, __gpu_shuffle_idx_u64(__lane_mask, __idx,
-                                    __builtin_bit_cast(uint64_t, __x)));
+      double, _gpu_shuffle_idx_u64(__lane_mask, __idx,
+                                   __builtin_bit_cast(uint64_t, __x)));
 }
 
 // Gets the sum of all lanes inside the warp or wavefront.
 #define __DO_LANE_SUM(__type, __suffix)                                        \
-  _DEFAULT_FN_ATTRS static __inline__ __type __gpu_lane_sum_##__suffix(        \
+  _DEFAULT_FN_ATTRS static __inline__ __type _gpu_lane_sum_##__suffix(         \
       uint64_t __lane_mask, __type __x) {                                      \
-    for (uint32_t __step = __gpu_num_lanes() / 2; __step > 0; __step /= 2) {   \
-      uint32_t __index = __step + __gpu_lane_id();                             \
-      __x += __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x);          \
+    for (uint32_t __step = _gpu_num_lanes() / 2; __step > 0; __step /= 2) {    \
+      uint32_t __index = __step + _gpu_lane_id();                              \
+      __x += _gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x);           \
     }                                                                          \
-    return __gpu_read_first_lane_##__suffix(__lane_mask, __x);                 \
+    return _gpu_read_first_lane_##__suffix(__lane_mask, __x);                  \
   }
-__DO_LANE_SUM(uint32_t, u32); // uint32_t __gpu_lane_sum_u32(m, x)
-__DO_LANE_SUM(uint64_t, u64); // uint64_t __gpu_lane_sum_u64(m, x)
-__DO_LANE_SUM(float, f32);    // float __gpu_lane_sum_f32(m, x)
-__DO_LANE_SUM(double, f64);   // double __gpu_lane_sum_f64(m, x)
+__DO_LANE_SUM(uint32_t, u32); // uint32_t _gpu_lane_sum_u32(m, x)
+__DO_LANE_SUM(uint64_t, u64); // uint64_t _gpu_lane_sum_u64(m, x)
+__DO_LANE_SUM(float, f32);    // float _gpu_lane_sum_f32(m, x)
+__DO_LANE_SUM(double, f64);   // double _gpu_lane_sum_f64(m, x)
 #undef __DO_LANE_SUM
 
 // Gets the accumulator scan of the threads in the warp or wavefront.
 #define __DO_LANE_SCAN(__type, __bitmask_type, __suffix)                       \
-  _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_scan_##__suffix(     \
+  _DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_lane_scan_##__suffix(      \
       uint64_t __lane_mask, uint32_t __x) {                                    \
-    for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {       \
-      uint32_t __index = __gpu_lane_id() - __step;                             \
-      __bitmask_type bitmask = __gpu_lane_id() >= __step;                      \
+    for (uint32_t __step = 1; __step < _gpu_num_lanes(); __step *= 2) {        \
+      uint32_t __index = _gpu_lane_id() - __step;                              \
+      __bitmask_type bitmask = _gpu_lane_id() >= __step;                       \
       __x += __builtin_bit_cast(                                               \
           __type,                                                              \
           -bitmask & __builtin_bit_cast(__bitmask_type,                        \
-                                        __gpu_shuffle_idx_##__suffix(          \
+                                        _gpu_shuffle_idx_##__suffix(           \
                                             __lane_mask, __index, __x)));      \
     }                                                                          \
     return __x;                                                                \
   }
-__DO_LANE_SCAN(uint32_t, uint32_t, u32); // uint32_t __gpu_lane_scan_u32(m, x)
-__DO_LANE_SCAN(uint64_t, uint64_t, u64); // uint64_t __gpu_lane_scan_u64(m, x)
-__DO_LANE_SCAN(float, uint32_t, f32);    // float __gpu_lane_scan_f32(m, x)
-__DO_LANE_SCAN(double, uint64_t, f64);   // double __gpu_lane_scan_f64(m, x)
+__DO_LANE_SCAN(uint32_t, uint32_t, u32); // uint32_t _gpu_lane_scan_u32(m, x)
+__DO_LANE_SCAN(uint64_t, uint64_t, u64); // uint64_t _gpu_lane_scan_u64(m, x)
+__DO_LANE_SCAN(float, uint32_t, f32);    // float _gpu_lane_scan_f32(m, x)
+__DO_LANE_SCAN(double, uint64_t, f64);   // double _gpu_lane_scan_f64(m, x)
 #undef __DO_LANE_SCAN
 
 _Pragma("omp end declare variant");
diff --git a/clang/lib/Headers/nvptxintrin.h b/clang/lib/Headers/nvptxintrin.h
index 962dca9cf03126..14ff684cb893a4 100644
--- a/clang/lib/Headers/nvptxintrin.h
+++ b/clang/lib/Headers/nvptxintrin.h
@@ -34,159 +34,159 @@ _Pragma("omp begin declare variant match(device = {arch(nvptx64)})");
 #define __gpu_kernel __attribute__((nvptx_kernel, visibility("protected")))
 
 // Returns the number of CUDA blocks in the 'x' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_x(void) {
   return __nvvm_read_ptx_sreg_nctaid_x();
 }
 
 // Returns the number of CUDA blocks in the 'y' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_y(void) {
   return __nvvm_read_ptx_sreg_nctaid_y();
 }
 
 // Returns the number of CUDA blocks in the 'z' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_z(void) {
   return __nvvm_read_ptx_sreg_nctaid_z();
 }
 
 // Returns the 'x' dimension of the current CUDA block's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_x(void) {
   return __nvvm_read_ptx_sreg_ctaid_x();
 }
 
 // Returns the 'y' dimension of the current CUDA block's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_y(void) {
   return __nvvm_read_ptx_sreg_ctaid_y();
 }
 
 // Returns the 'z' dimension of the current CUDA block's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_z(void) {
   return __nvvm_read_ptx_sreg_ctaid_z();
 }
 
 // Returns the number of CUDA threads in the 'x' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_x(void) {
   return __nvvm_read_ptx_sreg_ntid_x();
 }
 
 // Returns the number of CUDA threads in the 'y' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_y(void) {
   return __nvvm_read_ptx_sreg_ntid_y();
 }
 
 // Returns the number of CUDA threads in the 'z' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_z(void) {
   return __nvvm_read_ptx_sreg_ntid_z();
 }
 
 // Returns the 'x' dimension id of the thread in the current CUDA block.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_x(void) {
   return __nvvm_read_ptx_sreg_tid_x();
 }
 
 // Returns the 'y' dimension id of the thread in the current CU...
[truncated]

llvmbot · 2024-12-04T17:57:38Z

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
This is consistent with other intrinsic headers like the SSE/AVX
intrinsics. I don't think function names need to be specificlaly
reserved because we are not natively including this into any TUs. The
main reason to do this change is because LSP providers like clangd
intentionally ignore autocompleting __ prefixed names as they are
considered internal. This makes using this header really, really
annoying.

Patch is 39.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118674.diff

6 Files Affected:

(modified) clang/lib/Headers/amdgpuintrin.h (+27-27)
(modified) clang/lib/Headers/gpuintrin.h (+49-49)
(modified) clang/lib/Headers/nvptxintrin.h (+34-34)
(modified) clang/test/Headers/gpuintrin.c (+82-82)
(modified) clang/test/Headers/gpuintrin_lang.c (+2-2)
(modified) libc/shared/rpc_util.h (+7-7)

diff --git a/clang/lib/Headers/amdgpuintrin.h b/clang/lib/Headers/amdgpuintrin.h
index 720674a85f52cf..07330061647915 100644
--- a/clang/lib/Headers/amdgpuintrin.h
+++ b/clang/lib/Headers/amdgpuintrin.h
@@ -34,90 +34,90 @@ _Pragma("omp begin declare variant match(device = {arch(amdgcn)})");
 #define __gpu_kernel __attribute__((amdgpu_kernel, visibility("protected")))
 
 // Returns the number of workgroups in the 'x' dimension of the grid.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_x(void) {
   return __builtin_amdgcn_grid_size_x() / __builtin_amdgcn_workgroup_size_x();
 }
 
 // Returns the number of workgroups in the 'y' dimension of the grid.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_y(void) {
   return __builtin_amdgcn_grid_size_y() / __builtin_amdgcn_workgroup_size_y();
 }
 
 // Returns the number of workgroups in the 'z' dimension of the grid.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_z(void) {
   return __builtin_amdgcn_grid_size_z() / __builtin_amdgcn_workgroup_size_z();
 }
 
 // Returns the 'x' dimension of the current AMD workgroup's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_x(void) {
   return __builtin_amdgcn_workgroup_id_x();
 }
 
 // Returns the 'y' dimension of the current AMD workgroup's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_y(void) {
   return __builtin_amdgcn_workgroup_id_y();
 }
 
 // Returns the 'z' dimension of the current AMD workgroup's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_z(void) {
   return __builtin_amdgcn_workgroup_id_z();
 }
 
 // Returns the number of workitems in the 'x' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_x(void) {
   return __builtin_amdgcn_workgroup_size_x();
 }
 
 // Returns the number of workitems in the 'y' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_y(void) {
   return __builtin_amdgcn_workgroup_size_y();
 }
 
 // Returns the number of workitems in the 'z' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_z(void) {
   return __builtin_amdgcn_workgroup_size_z();
 }
 
 // Returns the 'x' dimension id of the workitem in the current AMD workgroup.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_x(void) {
   return __builtin_amdgcn_workitem_id_x();
 }
 
 // Returns the 'y' dimension id of the workitem in the current AMD workgroup.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_y(void) {
   return __builtin_amdgcn_workitem_id_y();
 }
 
 // Returns the 'z' dimension id of the workitem in the current AMD workgroup.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_z(void) {
   return __builtin_amdgcn_workitem_id_z();
 }
 
 // Returns the size of an AMD wavefront, either 32 or 64 depending on hardware
 // and compilation options.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_lanes(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_lanes(void) {
   return __builtin_amdgcn_wavefrontsize();
 }
 
 // Returns the id of the thread inside of an AMD wavefront executing together.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_id(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_lane_id(void) {
   return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
 }
 
 // Returns the bit-mask of active threads in the current wavefront.
-_DEFAULT_FN_ATTRS static __inline__ uint64_t __gpu_lane_mask(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint64_t _gpu_lane_mask(void) {
   return __builtin_amdgcn_read_exec();
 }
 
 // Copies the value from the first active thread in the wavefront to the rest.
 _DEFAULT_FN_ATTRS static __inline__ uint32_t
-__gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t __x) {
+_gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t __x) {
   return __builtin_amdgcn_readfirstlane(__x);
 }
 
 // Copies the value from the first active thread in the wavefront to the rest.
 _DEFAULT_FN_ATTRS __inline__ uint64_t
-__gpu_read_first_lane_u64(uint64_t __lane_mask, uint64_t __x) {
+_gpu_read_first_lane_u64(uint64_t __lane_mask, uint64_t __x) {
   uint32_t __hi = (uint32_t)(__x >> 32ull);
   uint32_t __lo = (uint32_t)(__x & 0xFFFFFFFF);
   return ((uint64_t)__builtin_amdgcn_readfirstlane(__hi) << 32ull) |
@@ -125,33 +125,33 @@ __gpu_read_first_lane_u64(uint64_t __lane_mask, uint64_t __x) {
 }
 
 // Returns a bitmask of threads in the current lane for which \p x is true.
-_DEFAULT_FN_ATTRS static __inline__ uint64_t __gpu_ballot(uint64_t __lane_mask,
-                                                          bool __x) {
+_DEFAULT_FN_ATTRS static __inline__ uint64_t _gpu_ballot(uint64_t __lane_mask,
+                                                         bool __x) {
   // The lane_mask & gives the nvptx semantics when lane_mask is a subset of
   // the active threads
   return __lane_mask & __builtin_amdgcn_ballot_w64(__x);
 }
 
 // Waits for all the threads in the block to converge and issues a fence.
-_DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_threads(void) {
+_DEFAULT_FN_ATTRS static __inline__ void _gpu_sync_threads(void) {
   __builtin_amdgcn_s_barrier();
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup");
 }
 
 // Wait for all threads in the wavefront to converge, this is a noop on AMDGPU.
-_DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_lane(uint64_t __lane_mask) {
+_DEFAULT_FN_ATTRS static __inline__ void _gpu_sync_lane(uint64_t __lane_mask) {
   __builtin_amdgcn_wave_barrier();
 }
 
 // Shuffles the the lanes inside the wavefront according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ uint32_t
-__gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x) {
+_gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x) {
   return __builtin_amdgcn_ds_bpermute(__idx << 2, __x);
 }
 
 // Shuffles the the lanes inside the wavefront according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
-__gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x) {
+_gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x) {
   uint32_t __hi = (uint32_t)(__x >> 32ull);
   uint32_t __lo = (uint32_t)(__x & 0xFFFFFFFF);
   return ((uint64_t)__builtin_amdgcn_ds_bpermute(__idx << 2, __hi) << 32ull) |
@@ -159,24 +159,24 @@ __gpu_shuffle_idx_u64(uint64_t __lane_mask, uint32_t __idx, uint64_t __x) {
 }
 
 // Returns true if the flat pointer points to CUDA 'shared' memory.
-_DEFAULT_FN_ATTRS static __inline__ bool __gpu_is_ptr_local(void *ptr) {
+_DEFAULT_FN_ATTRS static __inline__ bool _gpu_is_ptr_local(void *ptr) {
   return __builtin_amdgcn_is_shared((void __attribute__((address_space(0))) *)((
       void [[clang::opencl_generic]] *)ptr));
 }
 
 // Returns true if the flat pointer points to CUDA 'local' memory.
-_DEFAULT_FN_ATTRS static __inline__ bool __gpu_is_ptr_private(void *ptr) {
+_DEFAULT_FN_ATTRS static __inline__ bool _gpu_is_ptr_private(void *ptr) {
   return __builtin_amdgcn_is_private((void __attribute__((
       address_space(0))) *)((void [[clang::opencl_generic]] *)ptr));
 }
 
 // Terminates execution of the associated wavefront.
-_DEFAULT_FN_ATTRS [[noreturn]] static __inline__ void __gpu_exit(void) {
+_DEFAULT_FN_ATTRS [[noreturn]] static __inline__ void _gpu_exit(void) {
   __builtin_amdgcn_endpgm();
 }
 
 // Suspend the thread briefly to assist the scheduler during busy loops.
-_DEFAULT_FN_ATTRS static __inline__ void __gpu_thread_suspend(void) {
+_DEFAULT_FN_ATTRS static __inline__ void _gpu_thread_suspend(void) {
   __builtin_amdgcn_s_sleep(2);
 }
 
diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h
index 4c463c333308fc..be4ab81f6c961e 100644
--- a/clang/lib/Headers/gpuintrin.h
+++ b/clang/lib/Headers/gpuintrin.h
@@ -48,56 +48,56 @@ _Pragma("omp begin declare variant match(device = {kind(gpu)})");
 #define __GPU_Z_DIM 2
 
 // Returns the number of blocks in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_num_blocks_x();
+    return _gpu_num_blocks_x();
   case 1:
-    return __gpu_num_blocks_y();
+    return _gpu_num_blocks_y();
   case 2:
-    return __gpu_num_blocks_z();
+    return _gpu_num_blocks_z();
   default:
     __builtin_unreachable();
   }
 }
 
 // Returns the number of block id in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_block_id_x();
+    return _gpu_block_id_x();
   case 1:
-    return __gpu_block_id_y();
+    return _gpu_block_id_y();
   case 2:
-    return __gpu_block_id_z();
+    return _gpu_block_id_z();
   default:
     __builtin_unreachable();
   }
 }
 
 // Returns the number of threads in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_num_threads_x();
+    return _gpu_num_threads_x();
   case 1:
-    return __gpu_num_threads_y();
+    return _gpu_num_threads_y();
   case 2:
-    return __gpu_num_threads_z();
+    return _gpu_num_threads_z();
   default:
     __builtin_unreachable();
   }
 }
 
 // Returns the thread id in the requested dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id(int __dim) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id(int __dim) {
   switch (__dim) {
   case 0:
-    return __gpu_thread_id_x();
+    return _gpu_thread_id_x();
   case 1:
-    return __gpu_thread_id_y();
+    return _gpu_thread_id_y();
   case 2:
-    return __gpu_thread_id_z();
+    return _gpu_thread_id_z();
   default:
     __builtin_unreachable();
   }
@@ -105,83 +105,83 @@ _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id(int __dim) {
 
 // Get the first active thread inside the lane.
 _DEFAULT_FN_ATTRS static __inline__ uint64_t
-__gpu_first_lane_id(uint64_t __lane_mask) {
+_gpu_first_lane_id(uint64_t __lane_mask) {
   return __builtin_ffsll(__lane_mask) - 1;
 }
 
 // Conditional that is only true for a single thread in a lane.
 _DEFAULT_FN_ATTRS static __inline__ bool
-__gpu_is_first_in_lane(uint64_t __lane_mask) {
-  return __gpu_lane_id() == __gpu_first_lane_id(__lane_mask);
+_gpu_is_first_in_lane(uint64_t __lane_mask) {
+  return _gpu_lane_id() == _gpu_first_lane_id(__lane_mask);
 }
 
 // Gets the first floating point value from the active lanes.
 _DEFAULT_FN_ATTRS static __inline__ float
-__gpu_read_first_lane_f32(uint64_t __lane_mask, float __x) {
+_gpu_read_first_lane_f32(uint64_t __lane_mask, float __x) {
   return __builtin_bit_cast(
-      float, __gpu_read_first_lane_u32(__lane_mask,
-                                       __builtin_bit_cast(uint32_t, __x)));
+      float,
+      _gpu_read_first_lane_u32(__lane_mask, __builtin_bit_cast(uint32_t, __x)));
 }
 
 // Gets the first floating point value from the active lanes.
 _DEFAULT_FN_ATTRS static __inline__ double
-__gpu_read_first_lane_f64(uint64_t __lane_mask, double __x) {
+_gpu_read_first_lane_f64(uint64_t __lane_mask, double __x) {
   return __builtin_bit_cast(
-      double, __gpu_read_first_lane_u64(__lane_mask,
-                                        __builtin_bit_cast(uint64_t, __x)));
+      double,
+      _gpu_read_first_lane_u64(__lane_mask, __builtin_bit_cast(uint64_t, __x)));
 }
 
 // Shuffles the the lanes according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ float
-__gpu_shuffle_idx_f32(uint64_t __lane_mask, uint32_t __idx, float __x) {
+_gpu_shuffle_idx_f32(uint64_t __lane_mask, uint32_t __idx, float __x) {
   return __builtin_bit_cast(
-      float, __gpu_shuffle_idx_u32(__lane_mask, __idx,
-                                   __builtin_bit_cast(uint32_t, __x)));
+      float, _gpu_shuffle_idx_u32(__lane_mask, __idx,
+                                  __builtin_bit_cast(uint32_t, __x)));
 }
 
 // Shuffles the the lanes according to the given index.
 _DEFAULT_FN_ATTRS static __inline__ double
-__gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t __idx, double __x) {
+_gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t __idx, double __x) {
   return __builtin_bit_cast(
-      double, __gpu_shuffle_idx_u64(__lane_mask, __idx,
-                                    __builtin_bit_cast(uint64_t, __x)));
+      double, _gpu_shuffle_idx_u64(__lane_mask, __idx,
+                                   __builtin_bit_cast(uint64_t, __x)));
 }
 
 // Gets the sum of all lanes inside the warp or wavefront.
 #define __DO_LANE_SUM(__type, __suffix)                                        \
-  _DEFAULT_FN_ATTRS static __inline__ __type __gpu_lane_sum_##__suffix(        \
+  _DEFAULT_FN_ATTRS static __inline__ __type _gpu_lane_sum_##__suffix(         \
       uint64_t __lane_mask, __type __x) {                                      \
-    for (uint32_t __step = __gpu_num_lanes() / 2; __step > 0; __step /= 2) {   \
-      uint32_t __index = __step + __gpu_lane_id();                             \
-      __x += __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x);          \
+    for (uint32_t __step = _gpu_num_lanes() / 2; __step > 0; __step /= 2) {    \
+      uint32_t __index = __step + _gpu_lane_id();                              \
+      __x += _gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x);           \
     }                                                                          \
-    return __gpu_read_first_lane_##__suffix(__lane_mask, __x);                 \
+    return _gpu_read_first_lane_##__suffix(__lane_mask, __x);                  \
   }
-__DO_LANE_SUM(uint32_t, u32); // uint32_t __gpu_lane_sum_u32(m, x)
-__DO_LANE_SUM(uint64_t, u64); // uint64_t __gpu_lane_sum_u64(m, x)
-__DO_LANE_SUM(float, f32);    // float __gpu_lane_sum_f32(m, x)
-__DO_LANE_SUM(double, f64);   // double __gpu_lane_sum_f64(m, x)
+__DO_LANE_SUM(uint32_t, u32); // uint32_t _gpu_lane_sum_u32(m, x)
+__DO_LANE_SUM(uint64_t, u64); // uint64_t _gpu_lane_sum_u64(m, x)
+__DO_LANE_SUM(float, f32);    // float _gpu_lane_sum_f32(m, x)
+__DO_LANE_SUM(double, f64);   // double _gpu_lane_sum_f64(m, x)
 #undef __DO_LANE_SUM
 
 // Gets the accumulator scan of the threads in the warp or wavefront.
 #define __DO_LANE_SCAN(__type, __bitmask_type, __suffix)                       \
-  _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_scan_##__suffix(     \
+  _DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_lane_scan_##__suffix(      \
       uint64_t __lane_mask, uint32_t __x) {                                    \
-    for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {       \
-      uint32_t __index = __gpu_lane_id() - __step;                             \
-      __bitmask_type bitmask = __gpu_lane_id() >= __step;                      \
+    for (uint32_t __step = 1; __step < _gpu_num_lanes(); __step *= 2) {        \
+      uint32_t __index = _gpu_lane_id() - __step;                              \
+      __bitmask_type bitmask = _gpu_lane_id() >= __step;                       \
       __x += __builtin_bit_cast(                                               \
           __type,                                                              \
           -bitmask & __builtin_bit_cast(__bitmask_type,                        \
-                                        __gpu_shuffle_idx_##__suffix(          \
+                                        _gpu_shuffle_idx_##__suffix(           \
                                             __lane_mask, __index, __x)));      \
     }                                                                          \
     return __x;                                                                \
   }
-__DO_LANE_SCAN(uint32_t, uint32_t, u32); // uint32_t __gpu_lane_scan_u32(m, x)
-__DO_LANE_SCAN(uint64_t, uint64_t, u64); // uint64_t __gpu_lane_scan_u64(m, x)
-__DO_LANE_SCAN(float, uint32_t, f32);    // float __gpu_lane_scan_f32(m, x)
-__DO_LANE_SCAN(double, uint64_t, f64);   // double __gpu_lane_scan_f64(m, x)
+__DO_LANE_SCAN(uint32_t, uint32_t, u32); // uint32_t _gpu_lane_scan_u32(m, x)
+__DO_LANE_SCAN(uint64_t, uint64_t, u64); // uint64_t _gpu_lane_scan_u64(m, x)
+__DO_LANE_SCAN(float, uint32_t, f32);    // float _gpu_lane_scan_f32(m, x)
+__DO_LANE_SCAN(double, uint64_t, f64);   // double _gpu_lane_scan_f64(m, x)
 #undef __DO_LANE_SCAN
 
 _Pragma("omp end declare variant");
diff --git a/clang/lib/Headers/nvptxintrin.h b/clang/lib/Headers/nvptxintrin.h
index 962dca9cf03126..14ff684cb893a4 100644
--- a/clang/lib/Headers/nvptxintrin.h
+++ b/clang/lib/Headers/nvptxintrin.h
@@ -34,159 +34,159 @@ _Pragma("omp begin declare variant match(device = {arch(nvptx64)})");
 #define __gpu_kernel __attribute__((nvptx_kernel, visibility("protected")))
 
 // Returns the number of CUDA blocks in the 'x' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_x(void) {
   return __nvvm_read_ptx_sreg_nctaid_x();
 }
 
 // Returns the number of CUDA blocks in the 'y' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_y(void) {
   return __nvvm_read_ptx_sreg_nctaid_y();
 }
 
 // Returns the number of CUDA blocks in the 'z' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_blocks_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_blocks_z(void) {
   return __nvvm_read_ptx_sreg_nctaid_z();
 }
 
 // Returns the 'x' dimension of the current CUDA block's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_x(void) {
   return __nvvm_read_ptx_sreg_ctaid_x();
 }
 
 // Returns the 'y' dimension of the current CUDA block's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_y(void) {
   return __nvvm_read_ptx_sreg_ctaid_y();
 }
 
 // Returns the 'z' dimension of the current CUDA block's id.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_block_id_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_block_id_z(void) {
   return __nvvm_read_ptx_sreg_ctaid_z();
 }
 
 // Returns the number of CUDA threads in the 'x' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_x(void) {
   return __nvvm_read_ptx_sreg_ntid_x();
 }
 
 // Returns the number of CUDA threads in the 'y' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_y(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_y(void) {
   return __nvvm_read_ptx_sreg_ntid_y();
 }
 
 // Returns the number of CUDA threads in the 'z' dimension.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_num_threads_z(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_num_threads_z(void) {
   return __nvvm_read_ptx_sreg_ntid_z();
 }
 
 // Returns the 'x' dimension id of the thread in the current CUDA block.
-_DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_thread_id_x(void) {
+_DEFAULT_FN_ATTRS static __inline__ uint32_t _gpu_thread_id_x(void) {
   return __nvvm_read_ptx_sreg_tid_x();
 }
 
 // Returns the 'y' dimension id of the thread in the current CU...
[truncated]

Summary: This is consistent with other intrinsic headers like the SSE/AVX intrinsics. I don't think function names need to be specificlaly reserved because we are not natively including this into any TUs. The main reason to do this change is because LSP providers like `clangd` intentionally ignore autocompleting `__` prefixed names as they are considered internal. This makes using this header really, really annoying.

jhuber6 · 2024-12-04T20:02:17Z

I'm fine with this solution, but I'll sit on it for a bit to see if @AaronBallman has an opinion or if the clangd people get back to me on #118684.

AaronBallman · 2024-12-05T18:15:30Z

I'm not opposed, but it doesn't seem particularly well-motivated to me as I think the only real motivation is to work around a frustrating behavior in clangd. I think double underscores are more well understood by users to be reserved names.

jhuber6 · 2024-12-15T14:38:59Z

Unnecessary after updating clangd.

jhuber6 requested review from AaronBallman, JonChesterfield, arsenm, jdoerfert, shiltian and yxsamliu December 4, 2024 17:57

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics libc labels Dec 4, 2024

jhuber6 mentioned this pull request Dec 4, 2024

[clangd] Allow indexing of exposed resource directory headers #118684

Closed

arsenm approved these changes Dec 4, 2024

View reviewed changes

jhuber6 force-pushed the IntrinName branch from cf5e308 to 7e28f10 Compare December 4, 2024 20:01

jhuber6 closed this Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Clang] Rename GPU intrinsic functions from `__gpu_` to `_gpu_` #118674

[Clang] Rename GPU intrinsic functions from `__gpu_` to `_gpu_` #118674

Uh oh!

jhuber6 commented Dec 4, 2024

Uh oh!

llvmbot commented Dec 4, 2024 •

edited

Loading

Uh oh!

llvmbot commented Dec 4, 2024

Uh oh!

jhuber6 commented Dec 4, 2024

Uh oh!

AaronBallman commented Dec 5, 2024

Uh oh!

jhuber6 commented Dec 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Clang] Rename GPU intrinsic functions from __gpu_ to _gpu_ #118674

[Clang] Rename GPU intrinsic functions from __gpu_ to _gpu_ #118674

Uh oh!

Conversation

jhuber6 commented Dec 4, 2024

Uh oh!

llvmbot commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 4, 2024

Uh oh!

jhuber6 commented Dec 4, 2024

Uh oh!

AaronBallman commented Dec 5, 2024

Uh oh!

jhuber6 commented Dec 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Clang] Rename GPU intrinsic functions from `__gpu_` to `_gpu_` #118674

[Clang] Rename GPU intrinsic functions from `__gpu_` to `_gpu_` #118674

llvmbot commented Dec 4, 2024 •

edited

Loading