Skip to content

Conversation

@NickGuy-Arm
Copy link
Contributor

@NickGuy-Arm NickGuy-Arm commented Jan 7, 2025

If the streaming mode of the calling function is known, emitting @llvm.assume before calls to __arm_streaming_compatible functions allows other passes, such as the partial inliner, to make more informed decisions on what to do. In the case of the partial inliner, this allows for only the live code paths to be inlined, and leaves the other code paths to be removed as part of dead code elimination.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. labels Jan 7, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-clang-codegen

Author: Nicholas Guy (NickGuy-Arm)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/121917.diff

2 Files Affected:

  • (modified) clang/lib/CodeGen/CGBuiltin.cpp (+6)
  • (modified) clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c (+22-13)
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index dcea32969fb990..3765285c58f6ca 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -11335,6 +11335,12 @@ Value *CodeGenFunction::EmitAArch64SMEBuiltinExpr(unsigned BuiltinID,
       unsigned SMEAttrs = FPT->getAArch64SMEAttributes();
       if (!(SMEAttrs & FunctionType::SME_PStateSMCompatibleMask)) {
         bool IsStreaming = SMEAttrs & FunctionType::SME_PStateSMEnabledMask;
+        // Emit the llvm.assume intrinsic so that called functions can use the
+        // streaming mode information discerned here
+        Value* call = Builder.CreateCall(CGM.getIntrinsic(Builtin->LLVMIntrinsic));
+        if (!IsStreaming)
+          call = Builder.CreateNot(call);
+        Builder.CreateIntrinsic(Intrinsic::assume, {}, {call});
         return ConstantInt::getBool(Builder.getContext(), IsStreaming);
       }
     }
diff --git a/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c b/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c
index 72f2d17fc6dc11..1e630e196fcb66 100644
--- a/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c
+++ b/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c
@@ -22,23 +22,32 @@ bool test_in_streaming_mode_streaming_compatible(void) __arm_streaming_compatibl
 
 // CHECK-LABEL: @test_in_streaming_mode_streaming(
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP0]])
 // CHECK-NEXT:    ret i1 true
 //
 // CPP-CHECK-LABEL: @_Z32test_in_streaming_mode_streamingv(
 // CPP-CHECK-NEXT:  entry:
+// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CPP-CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP0]])
 // CPP-CHECK-NEXT:    ret i1 true
 //
 bool test_in_streaming_mode_streaming(void) __arm_streaming {
-//
   return __arm_in_streaming_mode();
 }
 
 // CHECK-LABEL: @test_in_streaming_mode_non_streaming(
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CHECK-NEXT:    [[TMP1:%.*]] = xor i1 [[TMP0]], true
+// CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP1]])
 // CHECK-NEXT:    ret i1 false
 //
 // CPP-CHECK-LABEL: @_Z36test_in_streaming_mode_non_streamingv(
 // CPP-CHECK-NEXT:  entry:
+// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CPP-CHECK-NEXT:    [[TMP1:%.*]] = xor i1 [[TMP0]], true
+// CPP-CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP1]])
 // CPP-CHECK-NEXT:    ret i1 false
 //
 bool test_in_streaming_mode_non_streaming(void) {
@@ -47,12 +56,12 @@ bool test_in_streaming_mode_non_streaming(void) {
 
 // CHECK-LABEL: @test_za_disable(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR7:[0-9]+]]
+// CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR8:[0-9]+]]
 // CHECK-NEXT:    ret void
 //
 // CPP-CHECK-LABEL: @_Z15test_za_disablev(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR7:[0-9]+]]
+// CPP-CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR8:[0-9]+]]
 // CPP-CHECK-NEXT:    ret void
 //
 void test_za_disable(void) __arm_streaming_compatible {
@@ -61,14 +70,14 @@ void test_za_disable(void) __arm_streaming_compatible {
 
 // CHECK-LABEL: @test_has_sme(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR7]]
+// CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR8]]
 // CHECK-NEXT:    [[TMP1:%.*]] = extractvalue { i64, i64 } [[TMP0]], 0
 // CHECK-NEXT:    [[TOBOOL_I:%.*]] = icmp slt i64 [[TMP1]], 0
 // CHECK-NEXT:    ret i1 [[TOBOOL_I]]
 //
 // CPP-CHECK-LABEL: @_Z12test_has_smev(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR8]]
 // CPP-CHECK-NEXT:    [[TMP1:%.*]] = extractvalue { i64, i64 } [[TMP0]], 0
 // CPP-CHECK-NEXT:    [[TOBOOL_I:%.*]] = icmp slt i64 [[TMP1]], 0
 // CPP-CHECK-NEXT:    ret i1 [[TOBOOL_I]]
@@ -91,12 +100,12 @@ void test_svundef_za(void) __arm_streaming_compatible __arm_out("za") {
 
 // CHECK-LABEL: @test_sc_memcpy(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z14test_sc_memcpyPvPKvm(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memcpy(void *dest, const void *src, size_t n) __arm_streaming_compatible {
@@ -105,12 +114,12 @@ void *test_sc_memcpy(void *dest, const void *src, size_t n) __arm_streaming_comp
 
 // CHECK-LABEL: @test_sc_memmove(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z15test_sc_memmovePvPKvm(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memmove(void *dest, const void *src, size_t n) __arm_streaming_compatible {
@@ -119,12 +128,12 @@ void *test_sc_memmove(void *dest, const void *src, size_t n) __arm_streaming_com
 
 // CHECK-LABEL: @test_sc_memset(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z14test_sc_memsetPvim(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memset(void *s, int c, size_t n) __arm_streaming_compatible {
@@ -133,12 +142,12 @@ void *test_sc_memset(void *s, int c, size_t n) __arm_streaming_compatible {
 
 // CHECK-LABEL: @test_sc_memchr(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z14test_sc_memchrPvim(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memchr(void *s, int c, size_t n) __arm_streaming_compatible {

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-clang

Author: Nicholas Guy (NickGuy-Arm)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/121917.diff

2 Files Affected:

  • (modified) clang/lib/CodeGen/CGBuiltin.cpp (+6)
  • (modified) clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c (+22-13)
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index dcea32969fb990..3765285c58f6ca 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -11335,6 +11335,12 @@ Value *CodeGenFunction::EmitAArch64SMEBuiltinExpr(unsigned BuiltinID,
       unsigned SMEAttrs = FPT->getAArch64SMEAttributes();
       if (!(SMEAttrs & FunctionType::SME_PStateSMCompatibleMask)) {
         bool IsStreaming = SMEAttrs & FunctionType::SME_PStateSMEnabledMask;
+        // Emit the llvm.assume intrinsic so that called functions can use the
+        // streaming mode information discerned here
+        Value* call = Builder.CreateCall(CGM.getIntrinsic(Builtin->LLVMIntrinsic));
+        if (!IsStreaming)
+          call = Builder.CreateNot(call);
+        Builder.CreateIntrinsic(Intrinsic::assume, {}, {call});
         return ConstantInt::getBool(Builder.getContext(), IsStreaming);
       }
     }
diff --git a/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c b/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c
index 72f2d17fc6dc11..1e630e196fcb66 100644
--- a/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c
+++ b/clang/test/CodeGen/AArch64/sme-intrinsics/acle_sme_state_funs.c
@@ -22,23 +22,32 @@ bool test_in_streaming_mode_streaming_compatible(void) __arm_streaming_compatibl
 
 // CHECK-LABEL: @test_in_streaming_mode_streaming(
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP0]])
 // CHECK-NEXT:    ret i1 true
 //
 // CPP-CHECK-LABEL: @_Z32test_in_streaming_mode_streamingv(
 // CPP-CHECK-NEXT:  entry:
+// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CPP-CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP0]])
 // CPP-CHECK-NEXT:    ret i1 true
 //
 bool test_in_streaming_mode_streaming(void) __arm_streaming {
-//
   return __arm_in_streaming_mode();
 }
 
 // CHECK-LABEL: @test_in_streaming_mode_non_streaming(
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CHECK-NEXT:    [[TMP1:%.*]] = xor i1 [[TMP0]], true
+// CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP1]])
 // CHECK-NEXT:    ret i1 false
 //
 // CPP-CHECK-LABEL: @_Z36test_in_streaming_mode_non_streamingv(
 // CPP-CHECK-NEXT:  entry:
+// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call i1 @llvm.aarch64.sme.in.streaming.mode()
+// CPP-CHECK-NEXT:    [[TMP1:%.*]] = xor i1 [[TMP0]], true
+// CPP-CHECK-NEXT:    tail call void @llvm.assume(i1 [[TMP1]])
 // CPP-CHECK-NEXT:    ret i1 false
 //
 bool test_in_streaming_mode_non_streaming(void) {
@@ -47,12 +56,12 @@ bool test_in_streaming_mode_non_streaming(void) {
 
 // CHECK-LABEL: @test_za_disable(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR7:[0-9]+]]
+// CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR8:[0-9]+]]
 // CHECK-NEXT:    ret void
 //
 // CPP-CHECK-LABEL: @_Z15test_za_disablev(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR7:[0-9]+]]
+// CPP-CHECK-NEXT:    tail call void @__arm_za_disable() #[[ATTR8:[0-9]+]]
 // CPP-CHECK-NEXT:    ret void
 //
 void test_za_disable(void) __arm_streaming_compatible {
@@ -61,14 +70,14 @@ void test_za_disable(void) __arm_streaming_compatible {
 
 // CHECK-LABEL: @test_has_sme(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR7]]
+// CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR8]]
 // CHECK-NEXT:    [[TMP1:%.*]] = extractvalue { i64, i64 } [[TMP0]], 0
 // CHECK-NEXT:    [[TOBOOL_I:%.*]] = icmp slt i64 [[TMP1]], 0
 // CHECK-NEXT:    ret i1 [[TOBOOL_I]]
 //
 // CPP-CHECK-LABEL: @_Z12test_has_smev(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[TMP0:%.*]] = tail call aarch64_sme_preservemost_from_x2 { i64, i64 } @__arm_sme_state() #[[ATTR8]]
 // CPP-CHECK-NEXT:    [[TMP1:%.*]] = extractvalue { i64, i64 } [[TMP0]], 0
 // CPP-CHECK-NEXT:    [[TOBOOL_I:%.*]] = icmp slt i64 [[TMP1]], 0
 // CPP-CHECK-NEXT:    ret i1 [[TOBOOL_I]]
@@ -91,12 +100,12 @@ void test_svundef_za(void) __arm_streaming_compatible __arm_out("za") {
 
 // CHECK-LABEL: @test_sc_memcpy(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z14test_sc_memcpyPvPKvm(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memcpy(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memcpy(void *dest, const void *src, size_t n) __arm_streaming_compatible {
@@ -105,12 +114,12 @@ void *test_sc_memcpy(void *dest, const void *src, size_t n) __arm_streaming_comp
 
 // CHECK-LABEL: @test_sc_memmove(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z15test_sc_memmovePvPKvm(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memmove(ptr noundef [[DEST:%.*]], ptr noundef [[SRC:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memmove(void *dest, const void *src, size_t n) __arm_streaming_compatible {
@@ -119,12 +128,12 @@ void *test_sc_memmove(void *dest, const void *src, size_t n) __arm_streaming_com
 
 // CHECK-LABEL: @test_sc_memset(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z14test_sc_memsetPvim(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memset(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memset(void *s, int c, size_t n) __arm_streaming_compatible {
@@ -133,12 +142,12 @@ void *test_sc_memset(void *s, int c, size_t n) __arm_streaming_compatible {
 
 // CHECK-LABEL: @test_sc_memchr(
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CHECK-NEXT:    ret ptr [[CALL]]
 //
 // CPP-CHECK-LABEL: @_Z14test_sc_memchrPvim(
 // CPP-CHECK-NEXT:  entry:
-// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR7]]
+// CPP-CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @__arm_sc_memchr(ptr noundef [[S:%.*]], i32 noundef [[C:%.*]], i64 noundef [[N:%.*]]) #[[ATTR8]]
 // CPP-CHECK-NEXT:    ret ptr [[CALL]]
 //
 void *test_sc_memchr(void *s, int c, size_t n) __arm_streaming_compatible {

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Builder.CreateCall(CGM.getIntrinsic(Builtin->LLVMIntrinsic));
if (!IsStreaming)
call = Builder.CreateNot(call);
Builder.CreateIntrinsic(Intrinsic::assume, {}, {call});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emitting the llvm.assume only for calls to __arm_in_streaming_mode() only has any effect if it is followed by a call to a streaming-compatible function.

To make this more useful generically, I think it needs to be emitted before any call from a streaming- or non-streaming function, to a streaming-compatible function. Then the partial-inliner, ipsccp pass, or function specialization pass can use the information to specialize or inline the callee.

@efriedma-quic
Copy link
Collaborator

I really don't see why we'd want to implement the optimization this way; can't we just add an instcombine for calls to llvm.aarch64.sme.in.streaming.mode?

@NickGuy-Arm
Copy link
Contributor Author

I really don't see why we'd want to implement the optimization this way; can't we just add an instcombine for calls to llvm.aarch64.sme.in.streaming.mode?

If what this patch initially did was our actual goal, then that would be an option. However what we want to achieve with this is to make the state of the streaming mode known to potentially inlineable called functions, allowing for further inlining and identification/removal of dead code. Ive fixed the implementation to actually achieve this, and I'll update the description with a bit more context

@NickGuy-Arm NickGuy-Arm changed the title [clang] Emit @llvm.assume when we know the streaming mode of the function [clang] Emit @llvm.assume before streaming_compatible functions when the streaming mode is known Jan 8, 2025
@sdesmalen-arm
Copy link
Collaborator

@efriedma-quic a motivating example is https://godbolt.org/z/aW9zTrdf9

With @NickGuy-Arm's patch, this would compile to:

_Z1xv:
        rdvl    x0, #1
        ret

_Z1yv:
        rdsvl   x0, #1
        ret

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickGuy-Arm this PR is missing some test-coverage.

@efriedma-quic do you have any thoughts on emitting the llvm.assume as a prologue to the actual call? I guess the alternative would be to analyse the function to see if there's any calls to streaming-compatible functions, and then emit the llvm.assume in the function's prologue or something, but that would be more work (and I'm not sure if that's worth it).

Comment on lines +1285 to +1290
const AArch64ABIInfo &ABIInfo = getABIInfo<AArch64ABIInfo>();
const TargetInfo &TI = ABIInfo.getContext().getTargetInfo();

if (!TI.hasFeature("sme"))
return;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check really necessary?

if (!TI.hasFeature("sme"))
return;

if (!Callee || !isStreamingCompatible(Callee))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Caller be nullptr too?

if (const auto *FPT = Caller->getType()->getAs<FunctionProtoType>()) {
unsigned SMEAttrs = FPT->getAArch64SMEAttributes();
if (!(SMEAttrs & FunctionType::SME_PStateSMCompatibleMask)) {
bool IsStreaming = SMEAttrs & FunctionType::SME_PStateSMEnabledMask;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
bool IsStreaming = SMEAttrs & FunctionType::SME_PStateSMEnabledMask;
bool CallerIsStreaming = SMEAttrs & FunctionType::SME_PStateSMEnabledMask;

@efriedma-quic
Copy link
Collaborator

However what we want to achieve with this is to make the state of the streaming mode known to potentially inlineable called functions, allowing for further inlining and identification/removal of dead code

It seems like we're in agreement that the llvm.assume doesn't actually add any information to the IR that the optimizer can't figure out itself. It sounds like the issue here is that you don't just need the call to be folded; you need the InlineCost evaluator to constant-fold the call? That makes things a little more complicated, but it still seems solvable...

llvm.assume handling is relatively expensive, so we don't want to scatter assume calls across emitted code when it isn't necessary.

@efriedma-quic
Copy link
Collaborator

Also, I'm not sure it's correct to mark int_aarch64_sme_in_streaming_mode IntrNoMem. That implies the result doesn't depend on any modifiable state, which isn't true because the result depends on whether the caller is in streaming mode. (I think inlining is suppressed for calls between streaming and non-streaming functions, but other interprocedural optimizations could get confused.)

@aemerson
Copy link
Contributor

aemerson commented Jan 8, 2025

However what we want to achieve with this is to make the state of the streaming mode known to potentially inlineable called functions, allowing for further inlining and identification/removal of dead code

It seems like we're in agreement that the llvm.assume doesn't actually add any information to the IR that the optimizer can't figure out itself. It sounds like the issue here is that you don't just need the call to be folded; you need the InlineCost evaluator to constant-fold the call? That makes things a little more complicated, but it still seems solvable...

llvm.assume handling is relatively expensive, so we don't want to scatter assume calls across emitted code when it isn't necessary.

This. Assume calls and emitting code before call sites seems an awfully big hammer for this kind of problem. We should try to use existing mechanisms if we can first.

@sdesmalen-arm
Copy link
Collaborator

The assumption cache mechanism is used by a number of passes, such as [partial] inlining, function specialization and IPSCCP (interprocedural sparse conditional constant propagation).

The idea behind doing this is to let optimizations iteratively apply knowledge about the streaming-mode of the caller when analyzing or optimizing a callee. If we'd move this functionality to e.g. constant-folding, then we can only apply this knowledge on functions where the streaming-mode is already known. If we'd combine the intrinsic in the instcombine pass, then this is combined only once, instead of iteratively by a pass that needs the information while e.g. inlining.

The idea of using llvm.assume is to give the compiler more knowledge when trying to inline or specialize a function. For example, if the callee has a if (__arm_in_streaming_mode()) { ... } branch, then the cost-model could infer that this is always/never executed, depending on mode of the call-site, which changes the cost/decision on whether or not to inline. IIRC, when it inlines the IR cloner also simplifies the code based on the assumption cache.

I understand that emitting this on a per-call basis is a bit of a hammer. I had expected/hoped that LLVM passes would hoist/CSE the multiple llvm.assume's to reduce the number of these intrinsics (because the value of the condition is constant inside the function). Am I correct in thinking that emitting an llvm.assume once in the entry block of a function is not a problem?

@NickGuy-Arm
Copy link
Contributor Author

The assumption cache mechanism is used by a number of passes...

@efriedma-quic @aemerson ping

I'm not sure it's correct to mark int_aarch64_sme_in_streaming_mode IntrNoMem...

That's fair, I had mistakenly assumed IntrNoMem to mean "I don't touch memory". I'll put up a separate PR to change it to use IntrReadMem shortly.

@efriedma-quic
Copy link
Collaborator

The two issues are sort of tied together: if the intrinsic isn't IntrNoMem, I suspect the assumption cache stops cooperating.

But anyway, I guess the question is really whether we want these constructs represented explicitly in IR, to try to leverage the AssumptionCache, or if we want to instead add a TTI hook for transform passes to query for the information in question. My default is not to use assumptions; llvm.assume calls still cause issues for optimizations, and they're not really necessary here. (I think the situation around llvm.assume has improved to some extent, but it's still a bit problematic.)

Even if we do decide we want assumptions, we likely want to add them in an IR pass, instead of in clang, so that other frontends don't need to duplicate this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AArch64 clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants