Skip to content

Conversation

@ayasin-a
Copy link
Contributor

@ayasin-a ayasin-a commented Jul 7, 2025

This patch permits loops with vector instructions to be unrolled.

Today there is an early exit in getUnrollingPreferences() of AArch64 targets if a vector instruction is observed in any of the loop blocks. This patch fixes that so common loops like this one get a chance to be unrolled:

void saxpy (float * dst, const float * src, const float a, const int len) {
    float32x4_t * vdst = (float32x4_t *)dst;
    float32x4_t * vsrc = (float32x4_t *)src;
    float32x4_t vk = vdupq_n_f32(a);
    for (int i = 0; i < (len >> 2); i++)
    {
        vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
    }
}

Auto-vectorized loops are still not unrolled, unless they were not interleaved when vectorized.

The provided test case shows the enhancement on top of Apple runtime unroll.

@github-actions
Copy link

github-actions bot commented Jul 7, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Jul 7, 2025

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-transforms

Author: Ahmad Yasin (ayasin-a)

Changes

This patch permits loops with vector instructions to be unrolled.

Today there is an early exit in getUnrollingPreferences() of AArch64 targets if a vector instruction is observed in any of the loop blocks. This patch fixes that so common loops like this one get a chance to be unrolled:

void saxpy (float * dst, const float * src, const float a, const int len) {
    float32x4_t * vdst = (float32x4_t *)dst;
    float32x4_t * vsrc = (float32x4_t *)src;
    float32x4_t vk = vdupq_n_f32(a);
    for (int i = 0; i &lt; (len &gt;&gt; 2); i++)
    {
        vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
    }
}

Auto-vectorized loops are still not unrolled, unless they were not interleaved when vectorized.

The provided test case shows the enhancement on top of Apple runtime unroll.


Full diff: https://github.com/llvm/llvm-project/pull/147420.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+6-6)
  • (modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+4-1)
  • (added) llvm/test/Transforms/LoopUnroll/AArch64/vector.ll (+131)
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 3f10da23b3494..673fb691cd603 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4872,15 +4872,15 @@ void AArch64TTIImpl::getUnrollingPreferences(
   // Disable partial & runtime unrolling on -Os.
   UP.PartialOptSizeThreshold = 0;
 
+  // No need to unroll auto-vectorized loops that were interleaved
+  if (findStringMetadataForLoop(L, "llvm.loop.isvectorized") &&
+      findStringMetadataForLoop(L, "llvm.loop.interleave.count"))
+    return;
+
   // Scan the loop: don't unroll loops with calls as this could prevent
-  // inlining. Don't unroll vector loops either, as they don't benefit much from
-  // unrolling.
+  // inlining.
   for (auto *BB : L->getBlocks()) {
     for (auto &I : *BB) {
-      // Don't unroll vectorised loop.
-      if (I.getType()->isVectorTy())
-        return;
-
       if (isa<CallBase>(I)) {
         if (isa<CallInst>(I) || isa<InvokeInst>(I))
           if (const Function *F = cast<CallBase>(I).getCalledFunction())
diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 0b9fee5727c6f..354633f837d45 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -1172,7 +1172,8 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE,
 
   LLVM_DEBUG(dbgs() << "Loop Unroll: F["
                     << L->getHeader()->getParent()->getName() << "] Loop %"
-                    << L->getHeader()->getName() << "\n");
+                    << L->getHeader()->getName() << " Full=" << OnlyFullUnroll
+                    << " Loc=" << L->getLocStr() << "\n");
   TransformationMode TM = hasUnrollTransformation(L);
   if (TM & TM_Disable)
     return LoopUnrollResult::Unmodified;
@@ -1219,6 +1220,8 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE,
       ProvidedFullUnrollMaxCount);
   TargetTransformInfo::PeelingPreferences PP = gatherPeelingPreferences(
       L, SE, TTI, ProvidedAllowPeeling, ProvidedAllowProfileBasedPeeling, true);
+  LLVM_DEBUG(dbgs() << "  UP.Partial=" << UP.Partial
+                    << " UP.Runtime=" << UP.Runtime << "\n");
 
   // Exit early if unrolling is disabled. For OptForSize, we pick the loop size
   // as threshold later on.
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/vector.ll b/llvm/test/Transforms/LoopUnroll/AArch64/vector.ll
new file mode 100644
index 0000000000000..dbde0df575472
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/vector.ll
@@ -0,0 +1,131 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p loop-unroll -mtriple=arm64-apple-macosx -mcpu=apple-m1 -S %s | FileCheck --check-prefix=APPLE %s
+define void @reverse(ptr %dst, ptr %src, i32 %len) {
+; APPLE-LABEL: define void @reverse(
+; APPLE-SAME: ptr [[DST:%.*]], ptr [[SRC:%.*]], i32 [[LEN:%.*]]) #[[ATTR0:[0-9]+]] {
+; APPLE-NEXT:  [[ENTRY:.*:]]
+; APPLE-NEXT:    [[SHR:%.*]] = ashr i32 [[LEN]], 2
+; APPLE-NEXT:    [[CMP7:%.*]] = icmp sgt i32 [[SHR]], 0
+; APPLE-NEXT:    br i1 [[CMP7]], label %[[FOR_BODY_PREHEADER:.*]], label %[[FOR_COND_CLEANUP:.*]]
+; APPLE:       [[FOR_BODY_PREHEADER]]:
+; APPLE-NEXT:    [[TMP0:%.*]] = zext nneg i32 [[SHR]] to i64
+; APPLE-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[SHR]] to i64
+; APPLE-NEXT:    [[TMP5:%.*]] = add nsw i64 [[WIDE_TRIP_COUNT]], -1
+; APPLE-NEXT:    [[XTRAITER:%.*]] = and i64 [[WIDE_TRIP_COUNT]], 7
+; APPLE-NEXT:    [[TMP6:%.*]] = icmp ult i64 [[TMP5]], 7
+; APPLE-NEXT:    br i1 [[TMP6]], label %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA:.*]], label %[[FOR_BODY_PREHEADER_NEW:.*]]
+; APPLE:       [[FOR_BODY_PREHEADER_NEW]]:
+; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[XTRAITER]]
+; APPLE-NEXT:    br label %[[FOR_BODY:.*]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA_LOOPEXIT:.*]]:
+; APPLE-NEXT:    [[INDVARS_IV_UNR_PH:%.*]] = phi i64 [ [[INDVARS_IV_NEXT_7:%.*]], %[[FOR_BODY]] ]
+; APPLE-NEXT:    br label %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]]:
+; APPLE-NEXT:    [[INDVARS_IV_UNR:%.*]] = phi i64 [ 0, %[[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_UNR_PH]], %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA_LOOPEXIT]] ]
+; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[FOR_BODY_EPIL_PREHEADER:.*]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]]
+; APPLE:       [[FOR_BODY_EPIL_PREHEADER]]:
+; APPLE-NEXT:    br label %[[FOR_BODY_EPIL:.*]]
+; APPLE:       [[FOR_BODY_EPIL]]:
+; APPLE-NEXT:    [[INDVARS_IV_EPIL:%.*]] = phi i64 [ [[INDVARS_IV_UNR]], %[[FOR_BODY_EPIL_PREHEADER]] ], [ [[INDVARS_IV_NEXT_EPIL:%.*]], %[[FOR_BODY_EPIL]] ]
+; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[FOR_BODY_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[FOR_BODY_EPIL]] ]
+; APPLE-NEXT:    [[TMP3:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_EPIL]]
+; APPLE-NEXT:    [[ARRAYIDX_EPIL:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP3]]
+; APPLE-NEXT:    [[TMP4:%.*]] = load <4 x float>, ptr [[ARRAYIDX_EPIL]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_EPIL:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_EPIL]]
+; APPLE-NEXT:    store <4 x float> [[TMP4]], ptr [[ARRAYIDX2_EPIL]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_EPIL]] = add nuw nsw i64 [[INDVARS_IV_EPIL]], 1
+; APPLE-NEXT:    [[EXITCOND_NOT_EPIL:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT_EPIL]], [[WIDE_TRIP_COUNT]]
+; APPLE-NEXT:    [[EPIL_ITER_NEXT]] = add i64 [[EPIL_ITER]], 1
+; APPLE-NEXT:    [[EPIL_ITER_CMP:%.*]] = icmp ne i64 [[EPIL_ITER_NEXT]], [[XTRAITER]]
+; APPLE-NEXT:    br i1 [[EPIL_ITER_CMP]], label %[[FOR_BODY_EPIL]], label %[[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA:.*]], !llvm.loop [[LOOP0:![0-9]+]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]]:
+; APPLE-NEXT:    br label %[[FOR_COND_CLEANUP_LOOPEXIT]]
+; APPLE:       [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; APPLE-NEXT:    br label %[[FOR_COND_CLEANUP]]
+; APPLE:       [[FOR_COND_CLEANUP]]:
+; APPLE-NEXT:    ret void
+; APPLE:       [[FOR_BODY]]:
+; APPLE-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, %[[FOR_BODY_PREHEADER_NEW]] ], [ [[INDVARS_IV_NEXT_7]], %[[FOR_BODY]] ]
+; APPLE-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[FOR_BODY_PREHEADER_NEW]] ], [ [[NITER_NEXT_7:%.*]], %[[FOR_BODY]] ]
+; APPLE-NEXT:    [[TMP1:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV]]
+; APPLE-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP1]]
+; APPLE-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[ARRAYIDX]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV]]
+; APPLE-NEXT:    store <4 x float> [[TMP2]], ptr [[ARRAYIDX2]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; APPLE-NEXT:    [[TMP7:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT]]
+; APPLE-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP7]]
+; APPLE-NEXT:    [[TMP8:%.*]] = load <4 x float>, ptr [[ARRAYIDX_1]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_1:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT]]
+; APPLE-NEXT:    store <4 x float> [[TMP8]], ptr [[ARRAYIDX2_1]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_1:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
+; APPLE-NEXT:    [[TMP9:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_1]]
+; APPLE-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP9]]
+; APPLE-NEXT:    [[TMP10:%.*]] = load <4 x float>, ptr [[ARRAYIDX_2]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_2:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_1]]
+; APPLE-NEXT:    store <4 x float> [[TMP10]], ptr [[ARRAYIDX2_2]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_2:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
+; APPLE-NEXT:    [[TMP11:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_2]]
+; APPLE-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP11]]
+; APPLE-NEXT:    [[TMP12:%.*]] = load <4 x float>, ptr [[ARRAYIDX_3]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_3:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_2]]
+; APPLE-NEXT:    store <4 x float> [[TMP12]], ptr [[ARRAYIDX2_3]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 4
+; APPLE-NEXT:    [[TMP13:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_3]]
+; APPLE-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP13]]
+; APPLE-NEXT:    [[TMP14:%.*]] = load <4 x float>, ptr [[ARRAYIDX_4]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_4:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_3]]
+; APPLE-NEXT:    store <4 x float> [[TMP14]], ptr [[ARRAYIDX2_4]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 5
+; APPLE-NEXT:    [[TMP15:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_4]]
+; APPLE-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP15]]
+; APPLE-NEXT:    [[TMP16:%.*]] = load <4 x float>, ptr [[ARRAYIDX_5]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_5:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_4]]
+; APPLE-NEXT:    store <4 x float> [[TMP16]], ptr [[ARRAYIDX2_5]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 6
+; APPLE-NEXT:    [[TMP17:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_5]]
+; APPLE-NEXT:    [[ARRAYIDX_6:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP17]]
+; APPLE-NEXT:    [[TMP18:%.*]] = load <4 x float>, ptr [[ARRAYIDX_6]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_6:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_5]]
+; APPLE-NEXT:    store <4 x float> [[TMP18]], ptr [[ARRAYIDX2_6]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_6:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 7
+; APPLE-NEXT:    [[TMP19:%.*]] = sub nsw i64 [[TMP0]], [[INDVARS_IV_NEXT_6]]
+; APPLE-NEXT:    [[ARRAYIDX_7:%.*]] = getelementptr inbounds <4 x float>, ptr [[SRC]], i64 [[TMP19]]
+; APPLE-NEXT:    [[TMP20:%.*]] = load <4 x float>, ptr [[ARRAYIDX_7]], align 16
+; APPLE-NEXT:    [[ARRAYIDX2_7:%.*]] = getelementptr inbounds nuw <4 x float>, ptr [[DST]], i64 [[INDVARS_IV_NEXT_6]]
+; APPLE-NEXT:    store <4 x float> [[TMP20]], ptr [[ARRAYIDX2_7]], align 16
+; APPLE-NEXT:    [[INDVARS_IV_NEXT_7]] = add nuw nsw i64 [[INDVARS_IV]], 8
+; APPLE-NEXT:    [[NITER_NEXT_7]] = add i64 [[NITER]], 8
+; APPLE-NEXT:    [[NITER_NCMP_7:%.*]] = icmp eq i64 [[NITER_NEXT_7]], [[UNROLL_ITER]]
+; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA_LOOPEXIT]], label %[[FOR_BODY]]
+;
+entry:
+  %shr = ashr i32 %len, 2
+  %cmp7 = icmp sgt i32 %shr, 0
+  br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:                               ; preds = %entry
+  %0 = zext nneg i32 %shr to i64
+  %wide.trip.count = zext nneg i32 %shr to i64
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %entry
+  ret void
+
+for.body:                                         ; preds = %for.body.preheader, %for.body
+  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+  %1 = sub nsw i64 %0, %indvars.iv
+  %arrayidx = getelementptr inbounds <4 x float>, ptr %src, i64 %1
+  %2 = load <4 x float>, ptr %arrayidx, align 16
+  %arrayidx2 = getelementptr inbounds nuw <4 x float>, ptr %dst, i64 %indvars.iv
+  store <4 x float> %2, ptr %arrayidx2, align 16
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+;.
+; APPLE: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]]}
+; APPLE: [[META1]] = !{!"llvm.loop.unroll.disable"}
+;.
+

@jcohen-apple jcohen-apple requested review from aemerson and fhahn July 8, 2025 06:23
@fhahn fhahn requested review from davemgreen and sjoerdmeijer July 8, 2025 10:51
@fhahn fhahn changed the title Unrolling of loops with vector instructions [AArch64] Unrolling of loops with vector instructions. Jul 8, 2025
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. As-is it will impact any AArch64 CPU. I added a few more reviewers

If not beneficial for other CPUs, we can also limit it to specific CPUs only.

@fhahn
Copy link
Contributor

fhahn commented Jul 8, 2025

It also looks like there are a few test failures in the pre-commit checks, which need checking

2025-07-08T07:01:02.7480047Z Failed Tests (3):
2025-07-08T07:01:02.7480307Z   LLVM :: Transforms/LoopUnroll/full-unroll-avoid-partial.ll
2025-07-08T07:01:02.7480677Z   LLVM :: Transforms/LoopUnroll/gh-issue77118-broken-lcssa-form.ll
2025-07-08T07:01:02.7481083Z   LLVM :: Transforms/LoopUnroll/guard-cost-for-unrolling.ll

@davemgreen
Copy link
Collaborator

I think this sounds OK. The reasons we didn't allow loops with vector instructions to unroll in the past because they are usually either are produced in the loop vectorizer and it has had the chance already to "interleave" them, or they come from the user through intrinsics and users often unroll manually to fit into a specific amount of register. (That might not always be the case though).

Providing we don't over-unroll them it sounds like it should be OK.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but unused GENERIC check lines still need to be dropped.

@ayasin-a
Copy link
Contributor Author

LGTM, but unused GENERIC check lines still need to be dropped.

Fixed.

@fhahn fhahn merged commit 671072e into llvm:main Jul 14, 2025
5 checks passed
@github-actions
Copy link

@ayasin-a Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jul 14, 2025
…47420)

This patch permits loops with vector instructions to be unrolled.

Today there is an early exit in `getUnrollingPreferences()` of AArch64
targets if a vector instruction is observed in any of the loop blocks.
This patch fixes that so common loops like this one get a chance to be
unrolled:

void saxpy (float * dst, const float * src, const float a, const int
len) {
        float32x4_t * vdst = (float32x4_t *)dst;
        float32x4_t * vsrc = (float32x4_t *)src;
        float32x4_t vk = vdupq_n_f32(a);
        for (int i = 0; i < (len >> 2); i++)
        {
            vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
        }
    }

Auto-vectorized loops are still not unrolled, unless they were not
interleaved when vectorized.

The provided test case shows the enhancement on top of runtime/partial
unrolling, depending on the CPU.

PR: llvm/llvm-project#147420
@UsmanNadeem
Copy link
Contributor

It looks like this patch now also prevents the unroll of the original scalar loop when the alias/iter checks fail. Seems like all the variants of the loop share the "llvm.loop.isvectorized" metadata so none of the variants will be unrolled.

See example: https://godbolt.org/z/nn1fh67ns

@ayasin-a
Copy link
Contributor Author

It looks like this patch now also prevents the unroll of the original scalar loop when the alias/iter checks fail. Seems like all the variants of the loop share the "llvm.loop.isvectorized" metadata so none of the variants will be unrolled.

See example: https://godbolt.org/z/nn1fh67ns

Thanks for reporting this @UsmanNadeem . I examined your example on macOS. Here is what happens:

  1. On baseline compiler, before my change, the scalar version of your loop is also not unrolled due to its constant tripcount.
  2. This is maintained after my change.
  3. When I move to a variable tripcount in your test loop. The baseline compiler would not unroll it too since due to the same autovectorized metadata check inside getAppleRuntimeUnrollPreferences().
  4. Now the same behavior in 3 is maintained in the new compiler. Yes its true that we hoisted that same check to the ARM generic code, but it would skip unrolling at a later stage on my system.

If you are interested in a particular mtriple/mcpu I may be able to have a look.

Also it would be nice to expand on why the scalar version of the loop needs to be highly optimized once the primary loop is autovectorized.

john-brawn-arm added a commit that referenced this pull request Jul 31, 2025
#147420 changed the unrolling preferences to permit unrolling of
non-auto vectorized loops by checking for the isvectorized attribute,
however when a loop is vectorized this attribute is put on both the
vector loop and the scalar epilogue, so this change prevented the scalar
epilogue from being unrolled.

Restore the previous behaviour of unrolling the scalar epilogue by
checking both for the isvectorized attribute and vector instructions in
the loop.
artagnon added a commit to artagnon/llvm-project that referenced this pull request Jul 31, 2025
Adjust the unrolling preferences to unroll hand-vectorized code, as well
as the scalar remainder of a vectorized loop. Inspired by a similar
effort in AArch64: see llvm#147420 and llvm#151164.
artagnon added a commit that referenced this pull request Jul 31, 2025
Adjust the unrolling preferences to unroll hand-vectorized code, as well
as the scalar remainder of a vectorized loop. Inspired by a similar
effort in AArch64: see #147420 and #151164.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants