[libclc] Use __ocml_cos/sin/tan/exp/lgamma/log/fmax/fmin/sqrt for AMDGPU #153328

wenju-he · 2025-08-13T01:27:11Z

Motivation is to upstream corresponding implementations at: https://github.com/intel/llvm/tree/sycl/libclc/clc/lib/amdgcn/math https://github.com/intel/llvm/tree/sycl/libclc/libspirv/lib/amdgcn-amdhsa/math

…MDGPU Motivation is to upstream corresponding implementations at: https://github.com/intel/llvm/tree/sycl/libclc/clc/lib/amdgcn/math https://github.com/intel/llvm/tree/sycl/libclc/libspirv/lib/amdgcn-amdhsa/math

wenju-he · 2025-08-13T01:28:09Z

Changes of this PR:

cos/sin/tan/exp*/lgamma/log*:
generic implementation in https://github.com/llvm/llvm-project/tree/main/libclc/clc/lib/generic/math is replaced with _ocml_* functions
sqrt

%15 = tail call float @llvm.sqrt.f32(float %14), !fpmath !4
%16 = fmul float %15, 0x3A90000000000000
< %15 = tail call float @__ocml_sqrt_f32(float noundef %14) Add bugprone-argument-comment option: IgnoreSingleArgument. #25
fmax:

%3 = tail call noundef <2 x float> @llvm.maximumnum.v2f32(<2 x float> %0, <2 x float> %1)
< %5 = tail call float @__ocml_fmax_f32(float noundef %3, float noundef %4) Add bugprone-argument-comment option: IgnoreSingleArgument. #25
< %9 = tail call float @__ocml_fmax_f32(float noundef %7, float noundef %8) Add bugprone-argument-comment option: IgnoreSingleArgument. #25
fmin:

%3 = tail call noundef <2 x float> @llvm.minimumnum.v2f32(<2 x float> %0, <2 x float> %1)
< %5 = tail call float @__ocml_fmin_f32(float noundef %3, float noundef %4) Add bugprone-argument-comment option: IgnoreSingleArgument. #25
< %9 = tail call float @__ocml_fmin_f32(float noundef %7, float noundef %8) Add bugprone-argument-comment option: IgnoreSingleArgument. #25

Use of _ocml_* functions scalarizes vector implementation since generic implementations are vectorized for vector input. So generic implementation has an advantage.
My understanding is fmax/fmin changes in this PR are not necessary due to the scalarization. However, I do see rocm-6.3.3/lib/llvm/lib/clang/18/lib/amdgcn/bitcode/opencl.bc fmax/fmin is implemented with scalar __ocml_fmax/fmin.
@ArsenArsen please review if the changes in this PR is necessary. If not, we can remove downstream implementations. The downstream implementations cause llvm-diff change for amdgcn--amdhsa target when refactoring libspirv to use _clc* functions.

arsenm

I'm fully rejecting this as a concept. libclc shall not depend on OCML.

libclc is a fork of OCML. It's dysfunctional that it's come to this, that libclc would end up calling into the original base. If you want to use these implementations, I'd rather merge the OCML content into libclc and migrate over to using it.

At the current moment, most of the f32 and f16 operations here should just use the llvm intrinsics. The intrinsics have inline implementations that are identical to the library content (some cases in OCML are already directly calling the intrinsic, so this is a pointless level of indirection). We do need to use some kind of external call for the trig and lgamma cases.

In the longer term, we should exclusively emit the llvm intrinsics. The choice of how to implement the libm equivalents involving a call or not should be a compiler backend decision. At the current time, neither libclc or ocml are in a state usable as a compiler runtime library. The usage model is backwards, I want to migrate to a state where the actual implementations have a usage model closer to compiler-rt. What we do now for both libraries is more like a precompiled header that cannot be relied on from the compiler.

arsenm · 2025-08-13T01:36:09Z

libclc/clc/lib/amdgcn/math/clc_exp.cl

+float __ocml_exp_f32(float);
+_CLC_OVERLOAD _CLC_DEF float __clc_exp(float x) { return __ocml_exp_f32(x); }
+


This is identical to the llvm intrinsic

arsenm · 2025-08-13T01:36:18Z

libclc/clc/lib/amdgcn/math/clc_exp.cl

+#ifdef cl_khr_fp16
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+half __ocml_exp_f16(half);
+_CLC_OVERLOAD _CLC_DEF half __clc_exp(half x) { return __ocml_exp_f16(x); }
+#endif


This is identical to the llvm intrinsic

arsenm · 2025-08-13T01:36:27Z

libclc/clc/lib/amdgcn/math/clc_exp10.cl

+float __ocml_exp10_f32(float);
+_CLC_OVERLOAD _CLC_DEF float __clc_exp10(float x) {
+  return __ocml_exp10_f32(x);
+}


This is identical to the llvm intrinsic

arsenm · 2025-08-13T01:36:38Z

libclc/clc/lib/amdgcn/math/clc_exp10.cl

+#ifdef cl_khr_fp16
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+half __ocml_exp10_f16(half);
+_CLC_OVERLOAD _CLC_DEF half __clc_exp10(half x) { return __ocml_exp10_f16(x); }


This is identical to the llvm intrinsic

arsenm · 2025-08-13T01:36:51Z

libclc/clc/lib/amdgcn/math/clc_fmax.cl

+float __ocml_fmax_f32(float, float);
+_CLC_OVERLOAD _CLC_DEF float __clc_fmax(float x, float y) {
+  return __ocml_fmax_f32(x, y);
+}
+
+#ifdef cl_khr_fp64
+#pragma OPENCL EXTENSION cl_khr_fp64 : enable
+double __ocml_fmax_f64(double, double);
+_CLC_OVERLOAD _CLC_DEF double __clc_fmax(double x, double y) {
+  return __ocml_fmax_f64(x, y);
+}
+#endif
+
+#ifdef cl_khr_fp16
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+half __ocml_fmax_f16(half, half);
+_CLC_OVERLOAD _CLC_DEF half __clc_fmax(half x, half y) {
+  return __ocml_fmax_f16(x, y);
+}
+#endif


This should directly use llvm intrinsics

arsenm · 2025-08-13T01:37:04Z

libclc/clc/lib/amdgcn/math/clc_sqrt_fp64.cl

+#ifdef cl_khr_fp64
+#pragma OPENCL EXTENSION cl_khr_fp64 : enable
+double __ocml_sqrt_f64(double);
+_CLC_OVERLOAD _CLC_DEF double __clc_sqrt(double x) {
+  return __ocml_sqrt_f64(x);
+}
+#endif


This should just use the llvm intrinsic

arsenm · 2025-08-13T01:37:20Z

libclc/clc/lib/amdgcn/math/clc_exp2.cl

+float __ocml_exp2_f32(float);
+_CLC_OVERLOAD _CLC_DEF float __clc_exp2(float x) { return __ocml_exp2_f32(x); }


This is identical to the llvm intrinsic

arsenm · 2025-08-13T01:37:31Z

libclc/clc/lib/amdgcn/math/clc_exp2.cl

+#ifdef cl_khr_fp16
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+half __ocml_exp2_f16(half);
+_CLC_OVERLOAD _CLC_DEF half __clc_exp2(half x) { return __ocml_exp2_f16(x); }
+#endif


This is identical to the llvm intrinsic

wenju-he · 2025-08-13T01:55:59Z

I'm fully rejecting this as a concept. libclc shall not depend on OCML.

thanks @arsenm. I see a lot of llvm-diff change due to use of OCML when doing refactoring in the downstream, so I put up this to consult with you. I'll close this PR.

In the longer term, we should exclusively emit the llvm intrinsics. The choice of how to implement the libm equivalents involving a call or not should be a compiler backend decision. At the current time, neither libclc or ocml are in a state usable as a compiler runtime library. The usage model is backwards, I want to migrate to a state where the actual implementations have a usage model closer to compiler-rt. What we do now for both libraries is more like a precompiled header that cannot be relied on from the compiler.

Just to confirm, libclc will emit llvm intrinsic and this is the state that we want for the future libclc, right?

Also delete double/half type implementations, which are not allowed per spec. llvm-diff shows lots of changes in libspirv-amdgcn--amdhsa.bc and libspirv-nvptx64--nvidiacl.bc, due to use of __ocml_* and __nv_* built-ins in clc and libspirv in intel/llvm repo. Based on review comment in llvm/llvm-project#153328, libclc shouldn't use __ocml_*. So bitcode change in this PR is expected.

wenju-he · 2025-08-14T02:23:13Z

If you want to use these implementations, I'd rather merge the OCML content into libclc and migrate over to using it.

Tag @frasercrmck
The amdgcn implementation probably needs improvement to use llvm elementwise builtin for e.g. half/float exp*

…#19779) Also delete double/half type implementations, which are not allowed per spec. llvm-diff shows lots of changes in libspirv-amdgcn--amdhsa.bc and libspirv-nvptx64--nvidiacl.bc, due to use of __ocml_* and __nv_* built-ins in clc and libspirv in intel/llvm repo. Based on review comment in llvm/llvm-project#153328, libclc shouldn't use __ocml_*. So bitcode change in this PR is expected.

arsenm · 2025-10-11T02:10:22Z

Just to confirm, libclc will emit llvm intrinsic and this is the state that we want for the future libclc, right?

Kind of yes and kind of no. libclc should have some code splitting and not be responsible for the lowest level implementation pieces. I'm going to give a talk about this as the upcoming GPU workshop at the conference

wenju-he · 2025-10-11T02:24:06Z

Just to confirm, libclc will emit llvm intrinsic and this is the state that we want for the future libclc, right?

Kind of yes and kind of no. libclc should have some code splitting and not be responsible for the lowest level implementation pieces. I'm going to give a talk about this as the upcoming GPU workshop at the conference

That's great. Unfortunately I can't to attend the workshop. Will it be recorded and uploaded e.g. to youtube? I'd like to watch the talk.

arsenm · 2025-10-11T02:36:40Z

That's great. Unfortunately I can't to attend the workshop. Will it be recorded and uploaded e.g. to youtube? I'd like to watch the talk.

No, but I can post the slides after

wenju-he · 2025-10-11T02:49:57Z

That's great. Unfortunately I can't to attend the workshop. Will it be recorded and uploaded e.g. to youtube? I'd like to watch the talk.

No, but I can post the slides after

thanks, I look forward to it.

wenju-he requested a review from frasercrmck August 13, 2025 01:27

llvmbot added the libclc libclc OpenCL library label Aug 13, 2025

wenju-he requested a review from arsenm August 13, 2025 01:27

arsenm requested changes Aug 13, 2025

View reviewed changes

wenju-he closed this Aug 13, 2025

wenju-he deleted the libclc-amdgcn-math-ocml branch August 13, 2025 02:01

wenju-he mentioned this pull request Aug 13, 2025

[libspirv] Use clc function in libspirv generic math half_* functions intel/llvm#19779

Merged

		float __ocml_exp_f32(float);
		_CLC_OVERLOAD _CLC_DEF float __clc_exp(float x) { return __ocml_exp_f32(x); }

		float __ocml_exp2_f32(float);
		_CLC_OVERLOAD _CLC_DEF float __clc_exp2(float x) { return __ocml_exp2_f32(x); }

[libclc] Use __ocml_cos/sin/tan/exp*/lgamma/log*/fmax/fmin/sqrt for AMDGPU #153328

[libclc] Use __ocml_cos/sin/tan/exp*/lgamma/log*/fmax/fmin/sqrt for AMDGPU #153328

Uh oh!

Conversation

wenju-he commented Aug 13, 2025

Uh oh!

wenju-he commented Aug 13, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenju-he commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenju-he commented Aug 14, 2025

Uh oh!

arsenm commented Oct 11, 2025

Uh oh!

wenju-he commented Oct 11, 2025

Uh oh!

arsenm commented Oct 11, 2025

Uh oh!

wenju-he commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[libclc] Use __ocml_cos/sin/tan/exp/lgamma/log/fmax/fmin/sqrt for AMDGPU #153328

[libclc] Use __ocml_cos/sin/tan/exp/lgamma/log/fmax/fmin/sqrt for AMDGPU #153328

wenju-he commented Aug 13, 2025 •

edited

Loading