[AArch64] Unrolling of loops with vector instructions. #147420

ayasin-a · 2025-07-07T23:08:42Z

This patch permits loops with vector instructions to be unrolled.

Today there is an early exit in getUnrollingPreferences() of AArch64 targets if a vector instruction is observed in any of the loop blocks. This patch fixes that so common loops like this one get a chance to be unrolled:

void saxpy (float * dst, const float * src, const float a, const int len) {
    float32x4_t * vdst = (float32x4_t *)dst;
    float32x4_t * vsrc = (float32x4_t *)src;
    float32x4_t vk = vdupq_n_f32(a);
    for (int i = 0; i < (len >> 2); i++)
    {
        vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
    }
}

Auto-vectorized loops are still not unrolled, unless they were not interleaved when vectorized.

The provided test case shows the enhancement on top of Apple runtime unroll.

github-actions · 2025-07-07T23:09:02Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-07-07T23:09:30Z

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-transforms

Author: Ahmad Yasin (ayasin-a)

Changes

This patch permits loops with vector instructions to be unrolled.

Today there is an early exit in getUnrollingPreferences() of AArch64 targets if a vector instruction is observed in any of the loop blocks. This patch fixes that so common loops like this one get a chance to be unrolled:

void saxpy (float * dst, const float * src, const float a, const int len) {
    float32x4_t * vdst = (float32x4_t *)dst;
    float32x4_t * vsrc = (float32x4_t *)src;
    float32x4_t vk = vdupq_n_f32(a);
    for (int i = 0; i &lt; (len &gt;&gt; 2); i++)
    {
        vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
    }
}

Auto-vectorized loops are still not unrolled, unless they were not interleaved when vectorized.

The provided test case shows the enhancement on top of Apple runtime unroll.

Full diff: https://github.com/llvm/llvm-project/pull/147420.diff

3 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+6-6)
(modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+4-1)
(added) llvm/test/Transforms/LoopUnroll/AArch64/vector.ll (+131)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 3f10da23b3494..673fb691cd603 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4872,15 +4872,15 @@ void AArch64TTIImpl::getUnrollingPreferences(
   // Disable partial & runtime unrolling on -Os.
   UP.PartialOptSizeThreshold = 0;
 
+  // No need to unroll auto-vectorized loops that were interleaved
+  if (findStringMetadataForLoop(L, "llvm.loop.isvectorized") &&
+      findStringMetadataForLoop(L, "llvm.loop.interleave.count"))
+    return;
+
   // Scan the loop: don't unroll loops with calls as this could prevent
-  // inlining. Don't unroll vector loops either, as they don't benefit much from
-  // unrolling.
+  // inlining.
   for (auto *BB : L->getBlocks()) {
     for (auto &I : *BB) {
-      // Don't unroll vectorised loop.
-      if (I.getType()->isVectorTy())
-        return;
-
       if (isa<CallBase>(I)) {
         if (isa<CallInst>(I) || isa<InvokeInst>(I))
           if (const Function *F = cast<CallBase>(I).getCalledFunction())
diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 0b9fee5727c6f..354633f837d45 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -1172,7 +1172,8 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE,
 
   LLVM_DEBUG(dbgs() << "Loop Unroll: F["
                     << L->getHeader()->getParent()->getName() << "] Loop %"
-                    << L->getHeader()->getName() << "\n");
+                    << L->getHeader()->getName() << " Full=" << OnlyFullUnroll
+                    << " Loc=" << L->getLocStr() << "\n");
   TransformationMode TM = hasUnrollTransformation(L);
   if (TM & TM_Disable)
     return LoopUnrollResult::Unmodified;
@@ -1219,6 +1220,8 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE,
       ProvidedFullUnrollMaxCount);
   TargetTransformInfo::PeelingPreferences PP = gatherPeelingPreferences(
       L, SE, TTI, ProvidedAllowPeeling, ProvidedAllowProfileBasedPeeling, true);
+  LLVM_DEBUG(dbgs() << "  UP.Partial=" << UP.Partial
+                    << " UP.Runtime=" << UP.Runtime << "\n");
 
   // Exit early if unrolling is disabled. For OptForSize, we pick the loop size
   // as threshold later on.
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/vector.ll b/llvm/test/Transforms/LoopUnroll/AArch64/vector.ll
new file mode 100644
index 0000000000000..dbde0df575472
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/vector.ll
@@ -0,0 +1,131 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p loop-unroll -mtriple=arm64-apple-macosx -mcpu=apple-m1 -S %s | FileCheck --check-prefix=APPLE %s
+define void @reverse(ptr %dst, ptr %src, i32 %len) {
+; APPLE-LABEL: define void @reverse(
+; APPLE-SAME: ptr [[DST:%.*]], ptr [[SRC:%.*]], i32 [[LEN:%.*]]) #[[ATTR0:[0-9]+]] {
+; APPLE-NEXT:  [[ENTRY:.*:]]
+; APPLE-NEXT:    [[SHR:%.*]] = ashr i32 [[LEN]], 2
+; APPLE-NEXT:    [[CMP7:%.*]] = icmp sgt i32 [[SHR]], 0
+; APPLE-NEXT:    br i1 [[CMP7]], label %[[FOR_BODY_PREHEADER:.*]], label %[[FOR_COND_CLEANUP:.*]]
+; APPLE:       [[FOR_BODY_PREHEADER]]:
+; APPLE-NEXT:    [[TMP0:%.*]] = zext nneg i32 [[SHR]] to i64
+; APPLE-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[SHR]] to i64
+; APPLE-NEXT:    [[TMP5:%.*]] = add nsw i64 [[WIDE_TRIP_COUNT]], -1
+; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[WIDE_TRIP_COUNT]], 7
+; APPLE-NEXT:    [[TMP6:%.*]] = icmp ult i64 [[TMP5]], 7
+; APPLE-NEXT:    br i1 [[TMP6]], label %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA:.*]], label %[[FOR_BODY_PREHEADER_NEW:.*]]
+; APPLE:       [[FOR_BODY_PREHEADER_NEW]]:
+; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[XTRAITER]]
+; APPLE-NEXT:    br label %[[FOR_BODY:.*]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA_LOOPEXIT:.*]]:
+; APPLE-NEXT:    [[INDVARS_IV_UNR_PH:%.*]] = phi i64 [ [[INDVARS_IV_NEXT_7:%.*]], %[[FOR_BODY]] ]
+; APPLE-NEXT:    br label %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]]:
+; APPLE-NEXT:    [[INDVARS_IV_UNR:%.*]] = phi i64 [ 0, %[[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_UNR_PH]], %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[FOR_BODY_EPIL_PREHEADER:.*]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]]
+; APPLE:       [[FOR_BODY_EPIL_PREHEADER]]:
+; APPLE-NEXT:    br label %[[FOR_BODY_EPIL:.*]]
+; APPLE:       [[FOR_BODY_EPIL]]:
+; APPLE-NEXT:    [[INDVARS_IV_EPIL:%.*]] = phi i64 [ [[INDVARS_IV_UNR]], %[[FOR_BODY_EPIL_PREHEADER]] ], [ [[INDVARS_IV_NEXT_EPIL:%.*]], %[[FOR_BODY_EPIL]] ]
+; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[FOR_BODY_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[FOR_BODY_EPIL]] ]
+; APPLE-NEXT:    [[TMP3:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_EPIL]]
+; APPLE-NEXT:    [[ARRAYIDX_EPIL:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP3]]
+; APPLE-NEXT:    [[TMP4:%.*]] = load <4 x float>, ptr [[ARRAYIDX_EPIL]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_EPIL:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_EPIL]]
+; APPLE-NEXT:    store <4 x float> [[TMP4]], ptr [[ARRAYIDX2_EPIL]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_EPIL]] = add nuw nsw i64 [[INDVARS_IV_EPIL]], 1
+; APPLE-NEXT:    [[EXITCOND_NOT_EPIL:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT_EPIL]], [[WIDE_TRIP_COUNT]]
+; APPLE-NEXT:    [[EPIL_ITER_NEXT]] = add i64 [[EPIL_ITER]], 1
+; APPLE-NEXT:    [[EPIL_ITER_CMP:%.*]] = icmp ne i64 [[EPIL_ITER_NEXT]], [[XTRAITER]]
+; APPLE-NEXT:    br i1 [[EPIL_ITER_CMP]], label %[[FOR_BODY_EPIL]], label %[[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA:.*]], !llvm.loop [[LOOP0:![0-9]+]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]]:
+; APPLE-NEXT:    br label %[[FOR_COND_CLEANUP_LOOPEXIT]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; APPLE-NEXT:    br label %[[FOR_COND_CLEANUP]]
+; APPLE:       [[FOR_COND_CLEANUP]]:
+; APPLE-NEXT:    ret void
+; APPLE:       [[FOR_BODY]]:
+; APPLE-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, %[[FOR_BODY_PREHEADER_NEW]] ], [ [[INDVARS_IV_NEXT_7]], %[[FOR_BODY]] ]
+; APPLE-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[FOR_BODY_PREHEADER_NEW]] ], [ [[NITER_NEXT_7:%.*]], %[[FOR_BODY]] ]
+; APPLE-NEXT:    [[TMP1:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV]]
+; APPLE-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP1]]
+; APPLE-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[ARRAYIDX]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV]]
+; APPLE-NEXT:    store <4 x float> [[TMP2]], ptr [[ARRAYIDX2]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; APPLE-NEXT:    [[TMP7:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT]]
+; APPLE-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP7]]
+; APPLE-NEXT:    [[TMP8:%.*]] = load <4 x float>, ptr [[ARRAYIDX_1]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_1:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT]]
+; APPLE-NEXT:    store <4 x float> [[TMP8]], ptr [[ARRAYIDX2_1]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_1:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
+; APPLE-NEXT:    [[TMP9:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_1]]
+; APPLE-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP9]]
+; APPLE-NEXT:    [[TMP10:%.*]] = load <4 x float>, ptr [[ARRAYIDX_2]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_2:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_1]]
+; APPLE-NEXT:    store <4 x float> [[TMP10]], ptr [[ARRAYIDX2_2]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_2:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
+; APPLE-NEXT:    [[TMP11:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_2]]
+; APPLE-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP11]]
+; APPLE-NEXT:    [[TMP12:%.*]] = load <4 x float>, ptr [[ARRAYIDX_3]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_3:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_2]]
+; APPLE-NEXT:    store <4 x float> [[TMP12]], ptr [[ARRAYIDX2_3]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 4
+; APPLE-NEXT:    [[TMP13:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_3]]
+; APPLE-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP13]]
+; APPLE-NEXT:    [[TMP14:%.*]] = load <4 x float>, ptr [[ARRAYIDX_4]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_4:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_3]]
+; APPLE-NEXT:    store <4 x float> [[TMP14]], ptr [[ARRAYIDX2_4]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 5
+; APPLE-NEXT:    [[TMP15:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_4]]
+; APPLE-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP15]]
+; APPLE-NEXT:    [[TMP16:%.*]] = load <4 x float>, ptr [[ARRAYIDX_5]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_5:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_4]]
+; APPLE-NEXT:    store <4 x float> [[TMP16]], ptr [[ARRAYIDX2_5]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 6
+; APPLE-NEXT:    [[TMP17:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_5]]
+; APPLE-NEXT:    [[ARRAYIDX_6:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP17]]
+; APPLE-NEXT:    [[TMP18:%.*]] = load <4 x float>, ptr [[ARRAYIDX_6]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_6:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_5]]
+; APPLE-NEXT:    store <4 x float> [[TMP18]], ptr [[ARRAYIDX2_6]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_6:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 7
+; APPLE-NEXT:    [[TMP19:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_6]]
+; APPLE-NEXT:    [[ARRAYIDX_7:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP19]]
+; APPLE-NEXT:    [[TMP20:%.*]] = load <4 x float>, ptr [[ARRAYIDX_7]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_7:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_6]]
+; APPLE-NEXT:    store <4 x float> [[TMP20]], ptr [[ARRAYIDX2_7]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_7]] = add nuw nsw i64 [[INDVARS_IV]], 8
+; APPLE-NEXT:    [[NITER_NEXT_7]] = add i64 [[NITER]], 8
+; APPLE-NEXT:    [[NITER_NCMP_7:%.*]] = icmp eq i64 [[NITER_NEXT_7]], [[UNROLL_ITER]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA_LOOPEXIT]], label %[[FOR_BODY]]
+;
+entry:
+  %shr = ashr i32 %len, 2
+  %cmp7 = icmp sgt i32 %shr, 0
+  br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:                               ; preds = %entry
+  %0 = zext nneg i32 %shr to i64
+  %wide.trip.count = zext nneg i32 %shr to i64
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %entry
+  ret void
+
+for.body:                                         ; preds = %for.body.preheader, %for.body
+  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+  %1 = sub nsw i64 %0, %indvars.iv
+  %arrayidx = getelementptr inbounds <4 x float>, ptr %src, i64 %1
+  %2 = load <4 x float>, ptr %arrayidx, align 16
+  %arrayidx2 = getelementptr inbounds nuw <4 x float>, ptr %dst, i64 %indvars.iv
+  store <4 x float> %2, ptr %arrayidx2, align 16
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+;.
+; APPLE: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]]}
+; APPLE: [[META1]] = !{!"llvm.loop.unroll.disable"}
+;.
+

fhahn

Thanks for the patch. As-is it will impact any AArch64 CPU. I added a few more reviewers

If not beneficial for other CPUs, we can also limit it to specific CPUs only.

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll

fhahn · 2025-07-08T10:54:19Z

It also looks like there are a few test failures in the pre-commit checks, which need checking

2025-07-08T07:01:02.7480047Z Failed Tests (3):
2025-07-08T07:01:02.7480307Z   LLVM :: Transforms/LoopUnroll/full-unroll-avoid-partial.ll
2025-07-08T07:01:02.7480677Z   LLVM :: Transforms/LoopUnroll/gh-issue77118-broken-lcssa-form.ll
2025-07-08T07:01:02.7481083Z   LLVM :: Transforms/LoopUnroll/guard-cost-for-unrolling.ll

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

davemgreen · 2025-07-09T19:40:05Z

I think this sounds OK. The reasons we didn't allow loops with vector instructions to unroll in the past because they are usually either are produced in the loop vectorizer and it has had the chance already to "interleave" them, or they come from the user through intrinsics and users often unroll manually to fit into a specific amount of register. (That might not always be the case though).

Providing we don't over-unroll them it sounds like it should be OK.

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll

…prints

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll

…ing) COMMON prefix

…=GENERIC with CORETEXA55

fhahn

LGTM, but unused GENERIC check lines still need to be dropped.

ayasin-a · 2025-07-14T14:41:06Z

LGTM, but unused GENERIC check lines still need to be dropped.

Fixed.

github-actions · 2025-07-14T19:53:29Z

@ayasin-a Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

…47420) This patch permits loops with vector instructions to be unrolled. Today there is an early exit in `getUnrollingPreferences()` of AArch64 targets if a vector instruction is observed in any of the loop blocks. This patch fixes that so common loops like this one get a chance to be unrolled: void saxpy (float * dst, const float * src, const float a, const int len) { float32x4_t * vdst = (float32x4_t *)dst; float32x4_t * vsrc = (float32x4_t *)src; float32x4_t vk = vdupq_n_f32(a); for (int i = 0; i < (len >> 2); i++) { vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk)); } } Auto-vectorized loops are still not unrolled, unless they were not interleaved when vectorized. The provided test case shows the enhancement on top of runtime/partial unrolling, depending on the CPU. PR: llvm/llvm-project#147420

UsmanNadeem · 2025-07-16T22:45:44Z

It looks like this patch now also prevents the unroll of the original scalar loop when the alias/iter checks fail. Seems like all the variants of the loop share the "llvm.loop.isvectorized" metadata so none of the variants will be unrolled.

See example: https://godbolt.org/z/nn1fh67ns

ayasin-a · 2025-07-20T15:11:33Z

It looks like this patch now also prevents the unroll of the original scalar loop when the alias/iter checks fail. Seems like all the variants of the loop share the "llvm.loop.isvectorized" metadata so none of the variants will be unrolled.

See example: https://godbolt.org/z/nn1fh67ns

Thanks for reporting this @UsmanNadeem . I examined your example on macOS. Here is what happens:

On baseline compiler, before my change, the scalar version of your loop is also not unrolled due to its constant tripcount.
This is maintained after my change.
When I move to a variable tripcount in your test loop. The baseline compiler would not unroll it too since due to the same autovectorized metadata check inside getAppleRuntimeUnrollPreferences().
Now the same behavior in 3 is maintained in the new compiler. Yes its true that we hoisted that same check to the ARM generic code, but it would skip unrolling at a later stage on my system.

If you are interested in a particular mtriple/mcpu I may be able to have a look.

Also it would be nice to expand on why the scalar version of the loop needs to be highly optimized once the primary loop is autovectorized.

#147420 changed the unrolling preferences to permit unrolling of non-auto vectorized loops by checking for the isvectorized attribute, however when a loop is vectorized this attribute is put on both the vector loop and the scalar epilogue, so this change prevented the scalar epilogue from being unrolled. Restore the previous behaviour of unrolling the scalar epilogue by checking both for the isvectorized attribute and vector instructions in the loop.

Adjust the unrolling preferences to unroll hand-vectorized code, as well as the scalar remainder of a vectorized loop. Inspired by a similar effort in AArch64: see llvm#147420 and llvm#151164.

Adjust the unrolling preferences to unroll hand-vectorized code, as well as the scalar remainder of a vectorized loop. Inspired by a similar effort in AArch64: see #147420 and #151164.

llvmbot added backend:AArch64 llvm:transforms labels Jul 7, 2025

jcohen-apple requested review from aemerson and fhahn July 8, 2025 06:23

fhahn requested review from davemgreen and sjoerdmeijer July 8, 2025 10:51

fhahn changed the title ~~Unrolling of loops with vector instructions~~ [AArch64] Unrolling of loops with vector instructions. Jul 8, 2025

fhahn reviewed Jul 8, 2025

View reviewed changes

fhahn reviewed Jul 9, 2025

View reviewed changes

nikic mentioned this pull request Jul 10, 2025

[TTI/{RISCV,AArch64}] Strip redundant unroll prefs #147982

Closed

fhahn reviewed Jul 10, 2025

View reviewed changes

artagnon reviewed Jul 11, 2025

View reviewed changes

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll Outdated Show resolved Hide resolved

Ahmad Yasin and others added 8 commits July 13, 2025 11:28

Let vectorized loops be unrolled

db867b9

No need to unroll auto-vectorized loops that were interleaved

744cea5

Adding a test for (runtime) unrolling vector loop and a couple debug …

23e689f

…prints

Simplified the vector.ll test + check for -mtriple aarch64

5665ac0

revert a couple debug prints

0445940

Revert the non-interleaved auto-vectorized case + rename %indvars.iv

7ff7ce3

a 2nd test for autovectorized loop with static tripcount

1553324

shortened the test & cleanup entry/exit labels

a3c2ac6

ayasin-a force-pushed the unroll-loops-v1 branch from 870b73a to a3c2ac6 Compare July 13, 2025 08:30

updated the vector.ll test and adding COMMON prefix

0f3dad9

fhahn reviewed Jul 13, 2025

View reviewed changes

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopUnroll/AArch64/vector.ll Show resolved Hide resolved

ayasin-a added 2 commits July 14, 2025 12:31

added test for full unroll; fixed !llvm.loop metadata; undo (non-work…

d33001d

…ing) COMMON prefix

another saxpy test flavor that Cortex-A55 can unroll + replace prefix…

eed737c

…=GENERIC with CORETEXA55

fhahn approved these changes Jul 14, 2025

View reviewed changes

cleaned up unused GENERIC check lines

06b3012

Update vector.ll

a48c02a

fhahn merged commit 671072e into llvm:main Jul 14, 2025
5 checks passed

artagnon mentioned this pull request Jul 29, 2025

[AArch64] Allow unrolling of scalar epilogue loops #151164

Merged

artagnon mentioned this pull request Jul 31, 2025

[RISCV] Adjust unroll prefs for loops with vectors #151525

Merged

[AArch64] Unrolling of loops with vector instructions. #147420

[AArch64] Unrolling of loops with vector instructions. #147420

Conversation

ayasin-a commented Jul 7, 2025

Uh oh!

github-actions bot commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fhahn commented Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davemgreen commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

ayasin-a commented Jul 14, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 14, 2025

Uh oh!

UsmanNadeem commented Jul 16, 2025

Uh oh!

ayasin-a commented Jul 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

llvmbot commented Jul 7, 2025 •

edited

Loading