Skip to content

Conversation

@alexey-bataev
Copy link
Member

This patch adds initial support for non-power-of-2 store-load forwarding
distance for targets, which (potentially!) support it.

Created using spr 1.3.5
@llvmbot llvmbot added vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Apr 29, 2025
@alexey-bataev alexey-bataev requested a review from fhahn April 29, 2025 20:45
@llvmbot
Copy link
Member

llvmbot commented Apr 29, 2025

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

This patch adds initial support for non-power-of-2 store-load forwarding
distance for targets, which (potentially!) support it.


Patch is 20.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137873.diff

4 Files Affected:

  • (modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+32-11)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+1-1)
  • (modified) llvm/lib/Analysis/LoopAccessAnalysis.cpp (+49-8)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll (+136-68)
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index f715e0ec8dbb4..02647adea95a8 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -180,9 +180,10 @@ class MemoryDepChecker {
 
   MemoryDepChecker(PredicatedScalarEvolution &PSE, const Loop *L,
                    const DenseMap<Value *, const SCEV *> &SymbolicStrides,
-                   unsigned MaxTargetVectorWidthInBits)
+                   unsigned MaxTargetVectorWidthInBits, bool AllowNonPow2Deps)
       : PSE(PSE), InnermostLoop(L), SymbolicStrides(SymbolicStrides),
-        MaxTargetVectorWidthInBits(MaxTargetVectorWidthInBits) {}
+        MaxTargetVectorWidthInBits(MaxTargetVectorWidthInBits),
+        AllowNonPow2Deps(AllowNonPow2Deps) {}
 
   /// Register the location (instructions are given increasing numbers)
   /// of a write access.
@@ -218,17 +219,29 @@ class MemoryDepChecker {
 
   /// Return true if there are no store-load forwarding dependencies.
   bool isSafeForAnyStoreLoadForwardDistances() const {
-    return MaxStoreLoadForwardSafeDistanceInBits ==
-           std::numeric_limits<uint64_t>::max();
+    return MaxPowerOf2StoreLoadForwardSafeDistanceInBits ==
+               std::numeric_limits<uint64_t>::max() &&
+           MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits ==
+               std::numeric_limits<uint64_t>::max();
   }
 
-  /// Return safe power-of-2 number of elements, which do not prevent store-load
-  /// forwarding, multiplied by the size of the elements in bits.
-  uint64_t getStoreLoadForwardSafeDistanceInBits() const {
+  /// Return safe number of elements, which do not prevent store-load
+  /// forwarding, multiplied by the size of the elements in bits (power-of-2).
+  uint64_t getPowerOf2StoreLoadForwardSafeDistanceInBits() const {
     assert(!isSafeForAnyStoreLoadForwardDistances() &&
            "Expected the distance, that prevent store-load forwarding, to be "
            "set.");
-    return MaxStoreLoadForwardSafeDistanceInBits;
+    return MaxPowerOf2StoreLoadForwardSafeDistanceInBits;
+  }
+
+  /// Return safe number of elements, which do not prevent store-load
+  /// forwarding, multiplied by the size of the elements in bits
+  /// (non-power-of-2).
+  uint64_t getNonPowerOf2StoreLoadForwardSafeDistanceInBits() const {
+    assert(!isSafeForAnyStoreLoadForwardDistances() &&
+           "Expected the distance, that prevent store-load forwarding, to be "
+           "set.");
+    return MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits;
   }
 
   /// In same cases when the dependency check fails we can still
@@ -319,9 +332,14 @@ class MemoryDepChecker {
   /// restrictive.
   uint64_t MaxSafeVectorWidthInBits = -1U;
 
-  /// Maximum power-of-2 number of elements, which do not prevent store-load
-  /// forwarding, multiplied by the size of the elements in bits.
-  uint64_t MaxStoreLoadForwardSafeDistanceInBits =
+  /// Maximum number of elements, which do not prevent store-load forwarding,
+  /// multiplied by the size of the elements in bits (power-of-2).
+  uint64_t MaxPowerOf2StoreLoadForwardSafeDistanceInBits =
+      std::numeric_limits<uint64_t>::max();
+
+  /// Maximum number of elements, which do not prevent store-load forwarding,
+  /// multiplied by the size of the elements in bits (non-power-of-2).
+  uint64_t MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits =
       std::numeric_limits<uint64_t>::max();
 
   /// If we see a non-constant dependence distance we can still try to
@@ -348,6 +366,9 @@ class MemoryDepChecker {
   /// backwards-vectorizable or unknown (triggering a runtime check).
   unsigned MaxTargetVectorWidthInBits = 0;
 
+  /// True if current target supports non-power-of-2 dependence distances.
+  bool AllowNonPow2Deps = false;
+
   /// Mapping of SCEV expressions to their expanded pointer bounds (pair of
   /// start and end pointer expressions).
   DenseMap<std::pair<const SCEV *, Type *>,
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index d654ac3ec9273..65d9938c8a0cd 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -415,7 +415,7 @@ class LoopVectorizationLegality {
   /// Return safe power-of-2 number of elements, which do not prevent store-load
   /// forwarding and safe to operate simultaneously.
   uint64_t getMaxStoreLoadForwardSafeDistanceInBits() const {
-    return LAI->getDepChecker().getStoreLoadForwardSafeDistanceInBits();
+    return LAI->getDepChecker().getPowerOf2StoreLoadForwardSafeDistanceInBits();
   }
 
   /// Returns true if vector representation of the instruction \p I
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index c65bb8be8b996..30fd50bd15303 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -1757,7 +1757,8 @@ bool MemoryDepChecker::couldPreventStoreLoadForward(uint64_t Distance,
   // Maximum vector factor.
   uint64_t MaxVFWithoutSLForwardIssuesPowerOf2 =
       std::min(VectorizerParams::MaxVectorWidth * TypeByteSize,
-               MaxStoreLoadForwardSafeDistanceInBits);
+               MaxPowerOf2StoreLoadForwardSafeDistanceInBits);
+  uint64_t MaxVFWithoutSLForwardIssuesNonPowerOf2 = 0;
 
   // Compute the smallest VF at which the store and load would be misaligned.
   for (uint64_t VF = 2 * TypeByteSize;
@@ -1769,24 +1770,61 @@ bool MemoryDepChecker::couldPreventStoreLoadForward(uint64_t Distance,
       break;
     }
   }
+  // RISCV VLA supports non-power-2 vector factor. So, we iterate in a
+  // backward order to find largest VF, which allows aligned stores-loads or
+  // the number of iterations between conflicting memory addresses is not less
+  // than 8 (NumItersForStoreLoadThroughMemory).
+  if (AllowNonPow2Deps) {
+    MaxVFWithoutSLForwardIssuesNonPowerOf2 =
+        std::min(8 * VectorizerParams::MaxVectorWidth / TypeByteSize,
+                 MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits);
+
+    for (uint64_t VF = MaxVFWithoutSLForwardIssuesNonPowerOf2;
+         VF > MaxVFWithoutSLForwardIssuesPowerOf2; VF -= TypeByteSize) {
+      if (Distance % VF == 0 ||
+          Distance / VF >= NumItersForStoreLoadThroughMemory) {
+        uint64_t GCD =
+            isSafeForAnyStoreLoadForwardDistances()
+                ? VF
+                : std::gcd(MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits,
+                           VF);
+        MaxVFWithoutSLForwardIssuesNonPowerOf2 = GCD;
+        break;
+      }
+    }
+  }
 
-  if (MaxVFWithoutSLForwardIssuesPowerOf2 < 2 * TypeByteSize) {
+  if (MaxVFWithoutSLForwardIssuesPowerOf2 < 2 * TypeByteSize &&
+      MaxVFWithoutSLForwardIssuesNonPowerOf2 < 2 * TypeByteSize) {
     LLVM_DEBUG(
         dbgs() << "LAA: Distance " << Distance
                << " that could cause a store-load forwarding conflict\n");
     return true;
   }
 
+  // Handle non-power-2 store-load forwarding distance, power-of-2 distance can
+  // be calculated.
+  if (AllowNonPow2Deps && CommonStride &&
+      MaxVFWithoutSLForwardIssuesNonPowerOf2 <
+          MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits &&
+      MaxVFWithoutSLForwardIssuesNonPowerOf2 !=
+          8 * VectorizerParams::MaxVectorWidth / TypeByteSize) {
+    uint64_t MaxVF = MaxVFWithoutSLForwardIssuesNonPowerOf2 / CommonStride;
+    uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
+    MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits =
+        std::min(MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits, MaxVFInBits);
+  }
+
   if (CommonStride &&
       MaxVFWithoutSLForwardIssuesPowerOf2 <
-          MaxStoreLoadForwardSafeDistanceInBits &&
+          MaxPowerOf2StoreLoadForwardSafeDistanceInBits &&
       MaxVFWithoutSLForwardIssuesPowerOf2 !=
           VectorizerParams::MaxVectorWidth * TypeByteSize) {
     uint64_t MaxVF =
         bit_floor(MaxVFWithoutSLForwardIssuesPowerOf2 / CommonStride);
     uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
-    MaxStoreLoadForwardSafeDistanceInBits =
-        std::min(MaxStoreLoadForwardSafeDistanceInBits, MaxVFInBits);
+    MaxPowerOf2StoreLoadForwardSafeDistanceInBits =
+        std::min(MaxPowerOf2StoreLoadForwardSafeDistanceInBits, MaxVFInBits);
   }
   return false;
 }
@@ -2985,8 +3023,9 @@ LoopAccessInfo::LoopAccessInfo(Loop *L, ScalarEvolution *SE,
     MaxTargetVectorWidthInBits =
         TTI->getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector) * 2;
 
-  DepChecker = std::make_unique<MemoryDepChecker>(*PSE, L, SymbolicStrides,
-                                                  MaxTargetVectorWidthInBits);
+  DepChecker = std::make_unique<MemoryDepChecker>(
+      *PSE, L, SymbolicStrides, MaxTargetVectorWidthInBits,
+      TTI && TTI->hasActiveVectorLength(0, nullptr, Align()));
   PtrRtChecking = std::make_unique<RuntimePointerChecking>(*DepChecker, SE);
   if (canAnalyzeLoop())
     CanVecMem = analyzeLoop(AA, LI, TLI, DT);
@@ -3000,7 +3039,9 @@ void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
       OS << " with a maximum safe vector width of "
          << DC.getMaxSafeVectorWidthInBits() << " bits";
     if (!DC.isSafeForAnyStoreLoadForwardDistances()) {
-      uint64_t SLDist = DC.getStoreLoadForwardSafeDistanceInBits();
+      uint64_t SLDist = DC.getNonPowerOf2StoreLoadForwardSafeDistanceInBits();
+      if (SLDist == std::numeric_limits<uint64_t>::max())
+        SLDist = DC.getPowerOf2StoreLoadForwardSafeDistanceInBits();
       OS << ", with a maximum safe store-load forward width of " << SLDist
          << " bits";
     }
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll b/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll
index 79dcfd2c4c08d..15fb79807b965 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
-; RUN: opt -passes='print<access-info>' -disable-output -mtriple=riscv64 -mattr=+v < %s 2>&1 | FileCheck %s
-; RUN: opt -passes='print<access-info>' -disable-output -mtriple=x86_64 < %s 2>&1 | FileCheck %s
+; RUN: opt -passes='print<access-info>' -disable-output -mtriple=riscv64 -mattr=+v < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,RISCV64
+; RUN: opt -passes='print<access-info>' -disable-output -mtriple=x86_64 < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,X86_64
 
 ; REQUIRES: riscv-registered-target, x86-registered-target
 
@@ -41,21 +41,37 @@ exit:
 ; Dependence distance is less than trip count, thus we must prove that
 ; chosen VF guaranteed to be less than dependence distance.
 define void @test_may_clobber1(ptr %p) {
-; CHECK-LABEL: 'test_may_clobber1'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Memory dependences are safe with a maximum safe vector width of 6400 bits, with a maximum safe store-load forward width of 256 bits
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizable:
-; CHECK-NEXT:            %v = load i64, ptr %a1, align 32 ->
-; CHECK-NEXT:            store i64 %v, ptr %a2, align 32
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'test_may_clobber1'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 6400 bits, with a maximum safe store-load forward width of 320 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; RISCV64-NEXT:            store i64 %v, ptr %a2, align 32
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'test_may_clobber1'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Memory dependences are safe with a maximum safe vector width of 6400 bits, with a maximum safe store-load forward width of 256 bits
+; X86_64-NEXT:      Dependences:
+; X86_64-NEXT:        BackwardVectorizable:
+; X86_64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; X86_64-NEXT:            store i64 %v, ptr %a2, align 32
+; X86_64-EMPTY:
+; X86_64-NEXT:      Run-time memory checks:
+; X86_64-NEXT:      Grouped accesses:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; X86_64-NEXT:      SCEV assumptions:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Expressions re-written:
 ;
 entry:
   br label %loop
@@ -76,22 +92,38 @@ exit:
 }
 
 define void @test_may_clobber2(ptr %p) {
-; CHECK-LABEL: 'test_may_clobber2'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizableButPreventsForwarding:
-; CHECK-NEXT:            %v = load i64, ptr %a1, align 32 ->
-; CHECK-NEXT:            store i64 %v, ptr %a2, align 32
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'test_may_clobber2'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 576 bits, with a maximum safe store-load forward width of 192 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; RISCV64-NEXT:            store i64 %v, ptr %a2, align 32
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'test_may_clobber2'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
+; X86_64-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
+; X86_64-NEXT:      Dependences:
+; X86_64-NEXT:        BackwardVectorizableButPreventsForwarding:
+; X86_64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; X86_64-NEXT:            store i64 %v, ptr %a2, align 32
+; X86_64-EMPTY:
+; X86_64-NEXT:      Run-time memory checks:
+; X86_64-NEXT:      Grouped accesses:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; X86_64-NEXT:      SCEV assumptions:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Expressions re-written:
 ;
 entry:
   br label %loop
@@ -112,21 +144,37 @@ exit:
 }
 
 define void @test_may_clobber3(ptr %p) {
-; CHECK-LABEL: 'test_may_clobber3'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Memory dependences are safe with a maximum safe vector width of 640 bits, with a maximum safe store-load forward width of 128 bits
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizable:
-; CHECK-NEXT:            %v = load i64, ptr %a1, align 32 ->
-; CHECK-NEXT:            store i64 %v, ptr %a2, align 32
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'test_may_clobber3'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 640 bits, with a maximum safe store-load forward width of 320 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; RISCV64-NEXT:            store i64 %v, ptr %a2, align 32
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'test_may_clobber3'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Memory dependences are safe with a maximum safe vector width of 640 bits, with a maximum safe store-load forward width of 128 bits
+; X86_64-NEXT:      Dependences:
+; X86_64-NEXT:        BackwardVectorizable:
+; X86_64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; X86_64-NEXT:            store i64 %v, ptr %a2, align 32
+; X86_64-EMPTY:
+; X86_64-NEXT:      Run-time memory checks:
+; X86_64-NEXT:      Grouped accesses:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; X86_64-NEXT:      SCEV assumptions:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Expressions re-written:
 ;
 entry:
   br label %loop
@@ -215,26 +263,46 @@ exit:
 }
 
 define void @non_power_2_storeloadforward(ptr %A) {
-; CHECK-LABEL: 'non_power_2_storeloadforward'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Forward:
-; CHECK-NEXT:            %3 = load i32, ptr %gep.iv.4, align 4 ->
-; CHECK-NEXT:            store i32 %add3, ptr %gep.iv, align 4
-; CHECK-EMPTY:
-; CHECK-NEXT:        BackwardVectorizableButPreventsForwarding:
-; CHECK-NEXT:            %1 = load i32, ptr %gep.iv.sub.3, align 4 ->
-; CHECK-NEXT:            store i32 %add3, ptr %gep.iv, align 4
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'non_power_2_storeloadforward'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 96 bits, with a maximum safe store-load forward width of 96 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        Forward:
+; RISCV64-NEXT:            %3 = load i32, ptr %gep.iv.4, align 4 ->
+; RISCV64-NEXT:            store i32 %add3, ptr %gep.iv, align 4
+; RISCV64-EMPTY:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %1 = load i32, ptr %gep.iv.sub.3, align 4 ->
+; RISCV64-NEXT:            store i32 %add3, ptr %gep.iv, align 4
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'non_power_2_storeloadforward'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable)...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Apr 29, 2025

@llvm/pr-subscribers-vectorizers

Author: Alexey Bataev (alexey-bataev)

Changes

This patch adds initial support for non-power-of-2 store-load forwarding
distance for targets, which (potentially!) support it.


Patch is 20.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137873.diff

4 Files Affected:

  • (modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+32-11)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+1-1)
  • (modified) llvm/lib/Analysis/LoopAccessAnalysis.cpp (+49-8)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll (+136-68)
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index f715e0ec8dbb4..02647adea95a8 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -180,9 +180,10 @@ class MemoryDepChecker {
 
   MemoryDepChecker(PredicatedScalarEvolution &PSE, const Loop *L,
                    const DenseMap<Value *, const SCEV *> &SymbolicStrides,
-                   unsigned MaxTargetVectorWidthInBits)
+                   unsigned MaxTargetVectorWidthInBits, bool AllowNonPow2Deps)
       : PSE(PSE), InnermostLoop(L), SymbolicStrides(SymbolicStrides),
-        MaxTargetVectorWidthInBits(MaxTargetVectorWidthInBits) {}
+        MaxTargetVectorWidthInBits(MaxTargetVectorWidthInBits),
+        AllowNonPow2Deps(AllowNonPow2Deps) {}
 
   /// Register the location (instructions are given increasing numbers)
   /// of a write access.
@@ -218,17 +219,29 @@ class MemoryDepChecker {
 
   /// Return true if there are no store-load forwarding dependencies.
   bool isSafeForAnyStoreLoadForwardDistances() const {
-    return MaxStoreLoadForwardSafeDistanceInBits ==
-           std::numeric_limits<uint64_t>::max();
+    return MaxPowerOf2StoreLoadForwardSafeDistanceInBits ==
+               std::numeric_limits<uint64_t>::max() &&
+           MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits ==
+               std::numeric_limits<uint64_t>::max();
   }
 
-  /// Return safe power-of-2 number of elements, which do not prevent store-load
-  /// forwarding, multiplied by the size of the elements in bits.
-  uint64_t getStoreLoadForwardSafeDistanceInBits() const {
+  /// Return safe number of elements, which do not prevent store-load
+  /// forwarding, multiplied by the size of the elements in bits (power-of-2).
+  uint64_t getPowerOf2StoreLoadForwardSafeDistanceInBits() const {
     assert(!isSafeForAnyStoreLoadForwardDistances() &&
            "Expected the distance, that prevent store-load forwarding, to be "
            "set.");
-    return MaxStoreLoadForwardSafeDistanceInBits;
+    return MaxPowerOf2StoreLoadForwardSafeDistanceInBits;
+  }
+
+  /// Return safe number of elements, which do not prevent store-load
+  /// forwarding, multiplied by the size of the elements in bits
+  /// (non-power-of-2).
+  uint64_t getNonPowerOf2StoreLoadForwardSafeDistanceInBits() const {
+    assert(!isSafeForAnyStoreLoadForwardDistances() &&
+           "Expected the distance, that prevent store-load forwarding, to be "
+           "set.");
+    return MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits;
   }
 
   /// In same cases when the dependency check fails we can still
@@ -319,9 +332,14 @@ class MemoryDepChecker {
   /// restrictive.
   uint64_t MaxSafeVectorWidthInBits = -1U;
 
-  /// Maximum power-of-2 number of elements, which do not prevent store-load
-  /// forwarding, multiplied by the size of the elements in bits.
-  uint64_t MaxStoreLoadForwardSafeDistanceInBits =
+  /// Maximum number of elements, which do not prevent store-load forwarding,
+  /// multiplied by the size of the elements in bits (power-of-2).
+  uint64_t MaxPowerOf2StoreLoadForwardSafeDistanceInBits =
+      std::numeric_limits<uint64_t>::max();
+
+  /// Maximum number of elements, which do not prevent store-load forwarding,
+  /// multiplied by the size of the elements in bits (non-power-of-2).
+  uint64_t MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits =
       std::numeric_limits<uint64_t>::max();
 
   /// If we see a non-constant dependence distance we can still try to
@@ -348,6 +366,9 @@ class MemoryDepChecker {
   /// backwards-vectorizable or unknown (triggering a runtime check).
   unsigned MaxTargetVectorWidthInBits = 0;
 
+  /// True if current target supports non-power-of-2 dependence distances.
+  bool AllowNonPow2Deps = false;
+
   /// Mapping of SCEV expressions to their expanded pointer bounds (pair of
   /// start and end pointer expressions).
   DenseMap<std::pair<const SCEV *, Type *>,
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index d654ac3ec9273..65d9938c8a0cd 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -415,7 +415,7 @@ class LoopVectorizationLegality {
   /// Return safe power-of-2 number of elements, which do not prevent store-load
   /// forwarding and safe to operate simultaneously.
   uint64_t getMaxStoreLoadForwardSafeDistanceInBits() const {
-    return LAI->getDepChecker().getStoreLoadForwardSafeDistanceInBits();
+    return LAI->getDepChecker().getPowerOf2StoreLoadForwardSafeDistanceInBits();
   }
 
   /// Returns true if vector representation of the instruction \p I
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index c65bb8be8b996..30fd50bd15303 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -1757,7 +1757,8 @@ bool MemoryDepChecker::couldPreventStoreLoadForward(uint64_t Distance,
   // Maximum vector factor.
   uint64_t MaxVFWithoutSLForwardIssuesPowerOf2 =
       std::min(VectorizerParams::MaxVectorWidth * TypeByteSize,
-               MaxStoreLoadForwardSafeDistanceInBits);
+               MaxPowerOf2StoreLoadForwardSafeDistanceInBits);
+  uint64_t MaxVFWithoutSLForwardIssuesNonPowerOf2 = 0;
 
   // Compute the smallest VF at which the store and load would be misaligned.
   for (uint64_t VF = 2 * TypeByteSize;
@@ -1769,24 +1770,61 @@ bool MemoryDepChecker::couldPreventStoreLoadForward(uint64_t Distance,
       break;
     }
   }
+  // RISCV VLA supports non-power-2 vector factor. So, we iterate in a
+  // backward order to find largest VF, which allows aligned stores-loads or
+  // the number of iterations between conflicting memory addresses is not less
+  // than 8 (NumItersForStoreLoadThroughMemory).
+  if (AllowNonPow2Deps) {
+    MaxVFWithoutSLForwardIssuesNonPowerOf2 =
+        std::min(8 * VectorizerParams::MaxVectorWidth / TypeByteSize,
+                 MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits);
+
+    for (uint64_t VF = MaxVFWithoutSLForwardIssuesNonPowerOf2;
+         VF > MaxVFWithoutSLForwardIssuesPowerOf2; VF -= TypeByteSize) {
+      if (Distance % VF == 0 ||
+          Distance / VF >= NumItersForStoreLoadThroughMemory) {
+        uint64_t GCD =
+            isSafeForAnyStoreLoadForwardDistances()
+                ? VF
+                : std::gcd(MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits,
+                           VF);
+        MaxVFWithoutSLForwardIssuesNonPowerOf2 = GCD;
+        break;
+      }
+    }
+  }
 
-  if (MaxVFWithoutSLForwardIssuesPowerOf2 < 2 * TypeByteSize) {
+  if (MaxVFWithoutSLForwardIssuesPowerOf2 < 2 * TypeByteSize &&
+      MaxVFWithoutSLForwardIssuesNonPowerOf2 < 2 * TypeByteSize) {
     LLVM_DEBUG(
         dbgs() << "LAA: Distance " << Distance
                << " that could cause a store-load forwarding conflict\n");
     return true;
   }
 
+  // Handle non-power-2 store-load forwarding distance, power-of-2 distance can
+  // be calculated.
+  if (AllowNonPow2Deps && CommonStride &&
+      MaxVFWithoutSLForwardIssuesNonPowerOf2 <
+          MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits &&
+      MaxVFWithoutSLForwardIssuesNonPowerOf2 !=
+          8 * VectorizerParams::MaxVectorWidth / TypeByteSize) {
+    uint64_t MaxVF = MaxVFWithoutSLForwardIssuesNonPowerOf2 / CommonStride;
+    uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
+    MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits =
+        std::min(MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits, MaxVFInBits);
+  }
+
   if (CommonStride &&
       MaxVFWithoutSLForwardIssuesPowerOf2 <
-          MaxStoreLoadForwardSafeDistanceInBits &&
+          MaxPowerOf2StoreLoadForwardSafeDistanceInBits &&
       MaxVFWithoutSLForwardIssuesPowerOf2 !=
           VectorizerParams::MaxVectorWidth * TypeByteSize) {
     uint64_t MaxVF =
         bit_floor(MaxVFWithoutSLForwardIssuesPowerOf2 / CommonStride);
     uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
-    MaxStoreLoadForwardSafeDistanceInBits =
-        std::min(MaxStoreLoadForwardSafeDistanceInBits, MaxVFInBits);
+    MaxPowerOf2StoreLoadForwardSafeDistanceInBits =
+        std::min(MaxPowerOf2StoreLoadForwardSafeDistanceInBits, MaxVFInBits);
   }
   return false;
 }
@@ -2985,8 +3023,9 @@ LoopAccessInfo::LoopAccessInfo(Loop *L, ScalarEvolution *SE,
     MaxTargetVectorWidthInBits =
         TTI->getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector) * 2;
 
-  DepChecker = std::make_unique<MemoryDepChecker>(*PSE, L, SymbolicStrides,
-                                                  MaxTargetVectorWidthInBits);
+  DepChecker = std::make_unique<MemoryDepChecker>(
+      *PSE, L, SymbolicStrides, MaxTargetVectorWidthInBits,
+      TTI && TTI->hasActiveVectorLength(0, nullptr, Align()));
   PtrRtChecking = std::make_unique<RuntimePointerChecking>(*DepChecker, SE);
   if (canAnalyzeLoop())
     CanVecMem = analyzeLoop(AA, LI, TLI, DT);
@@ -3000,7 +3039,9 @@ void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
       OS << " with a maximum safe vector width of "
          << DC.getMaxSafeVectorWidthInBits() << " bits";
     if (!DC.isSafeForAnyStoreLoadForwardDistances()) {
-      uint64_t SLDist = DC.getStoreLoadForwardSafeDistanceInBits();
+      uint64_t SLDist = DC.getNonPowerOf2StoreLoadForwardSafeDistanceInBits();
+      if (SLDist == std::numeric_limits<uint64_t>::max())
+        SLDist = DC.getPowerOf2StoreLoadForwardSafeDistanceInBits();
       OS << ", with a maximum safe store-load forward width of " << SLDist
          << " bits";
     }
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll b/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll
index 79dcfd2c4c08d..15fb79807b965 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/safe-with-dep-distance-non-power-of-2.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
-; RUN: opt -passes='print<access-info>' -disable-output -mtriple=riscv64 -mattr=+v < %s 2>&1 | FileCheck %s
-; RUN: opt -passes='print<access-info>' -disable-output -mtriple=x86_64 < %s 2>&1 | FileCheck %s
+; RUN: opt -passes='print<access-info>' -disable-output -mtriple=riscv64 -mattr=+v < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,RISCV64
+; RUN: opt -passes='print<access-info>' -disable-output -mtriple=x86_64 < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,X86_64
 
 ; REQUIRES: riscv-registered-target, x86-registered-target
 
@@ -41,21 +41,37 @@ exit:
 ; Dependence distance is less than trip count, thus we must prove that
 ; chosen VF guaranteed to be less than dependence distance.
 define void @test_may_clobber1(ptr %p) {
-; CHECK-LABEL: 'test_may_clobber1'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Memory dependences are safe with a maximum safe vector width of 6400 bits, with a maximum safe store-load forward width of 256 bits
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizable:
-; CHECK-NEXT:            %v = load i64, ptr %a1, align 32 ->
-; CHECK-NEXT:            store i64 %v, ptr %a2, align 32
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'test_may_clobber1'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 6400 bits, with a maximum safe store-load forward width of 320 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; RISCV64-NEXT:            store i64 %v, ptr %a2, align 32
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'test_may_clobber1'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Memory dependences are safe with a maximum safe vector width of 6400 bits, with a maximum safe store-load forward width of 256 bits
+; X86_64-NEXT:      Dependences:
+; X86_64-NEXT:        BackwardVectorizable:
+; X86_64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; X86_64-NEXT:            store i64 %v, ptr %a2, align 32
+; X86_64-EMPTY:
+; X86_64-NEXT:      Run-time memory checks:
+; X86_64-NEXT:      Grouped accesses:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; X86_64-NEXT:      SCEV assumptions:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Expressions re-written:
 ;
 entry:
   br label %loop
@@ -76,22 +92,38 @@ exit:
 }
 
 define void @test_may_clobber2(ptr %p) {
-; CHECK-LABEL: 'test_may_clobber2'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizableButPreventsForwarding:
-; CHECK-NEXT:            %v = load i64, ptr %a1, align 32 ->
-; CHECK-NEXT:            store i64 %v, ptr %a2, align 32
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'test_may_clobber2'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 576 bits, with a maximum safe store-load forward width of 192 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; RISCV64-NEXT:            store i64 %v, ptr %a2, align 32
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'test_may_clobber2'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
+; X86_64-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
+; X86_64-NEXT:      Dependences:
+; X86_64-NEXT:        BackwardVectorizableButPreventsForwarding:
+; X86_64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; X86_64-NEXT:            store i64 %v, ptr %a2, align 32
+; X86_64-EMPTY:
+; X86_64-NEXT:      Run-time memory checks:
+; X86_64-NEXT:      Grouped accesses:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; X86_64-NEXT:      SCEV assumptions:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Expressions re-written:
 ;
 entry:
   br label %loop
@@ -112,21 +144,37 @@ exit:
 }
 
 define void @test_may_clobber3(ptr %p) {
-; CHECK-LABEL: 'test_may_clobber3'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Memory dependences are safe with a maximum safe vector width of 640 bits, with a maximum safe store-load forward width of 128 bits
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizable:
-; CHECK-NEXT:            %v = load i64, ptr %a1, align 32 ->
-; CHECK-NEXT:            store i64 %v, ptr %a2, align 32
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'test_may_clobber3'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 640 bits, with a maximum safe store-load forward width of 320 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; RISCV64-NEXT:            store i64 %v, ptr %a2, align 32
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'test_may_clobber3'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Memory dependences are safe with a maximum safe vector width of 640 bits, with a maximum safe store-load forward width of 128 bits
+; X86_64-NEXT:      Dependences:
+; X86_64-NEXT:        BackwardVectorizable:
+; X86_64-NEXT:            %v = load i64, ptr %a1, align 32 ->
+; X86_64-NEXT:            store i64 %v, ptr %a2, align 32
+; X86_64-EMPTY:
+; X86_64-NEXT:      Run-time memory checks:
+; X86_64-NEXT:      Grouped accesses:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; X86_64-NEXT:      SCEV assumptions:
+; X86_64-EMPTY:
+; X86_64-NEXT:      Expressions re-written:
 ;
 entry:
   br label %loop
@@ -215,26 +263,46 @@ exit:
 }
 
 define void @non_power_2_storeloadforward(ptr %A) {
-; CHECK-LABEL: 'non_power_2_storeloadforward'
-; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
-; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Forward:
-; CHECK-NEXT:            %3 = load i32, ptr %gep.iv.4, align 4 ->
-; CHECK-NEXT:            store i32 %add3, ptr %gep.iv, align 4
-; CHECK-EMPTY:
-; CHECK-NEXT:        BackwardVectorizableButPreventsForwarding:
-; CHECK-NEXT:            %1 = load i32, ptr %gep.iv.sub.3, align 4 ->
-; CHECK-NEXT:            store i32 %add3, ptr %gep.iv, align 4
-; CHECK-EMPTY:
-; CHECK-NEXT:      Run-time memory checks:
-; CHECK-NEXT:      Grouped accesses:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
-; CHECK-NEXT:      SCEV assumptions:
-; CHECK-EMPTY:
-; CHECK-NEXT:      Expressions re-written:
+; RISCV64-LABEL: 'non_power_2_storeloadforward'
+; RISCV64-NEXT:    loop:
+; RISCV64-NEXT:      Memory dependences are safe with a maximum safe vector width of 96 bits, with a maximum safe store-load forward width of 96 bits
+; RISCV64-NEXT:      Dependences:
+; RISCV64-NEXT:        Forward:
+; RISCV64-NEXT:            %3 = load i32, ptr %gep.iv.4, align 4 ->
+; RISCV64-NEXT:            store i32 %add3, ptr %gep.iv, align 4
+; RISCV64-EMPTY:
+; RISCV64-NEXT:        BackwardVectorizable:
+; RISCV64-NEXT:            %1 = load i32, ptr %gep.iv.sub.3, align 4 ->
+; RISCV64-NEXT:            store i32 %add3, ptr %gep.iv, align 4
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Run-time memory checks:
+; RISCV64-NEXT:      Grouped accesses:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Non vectorizable stores to invariant address were not found in loop.
+; RISCV64-NEXT:      SCEV assumptions:
+; RISCV64-EMPTY:
+; RISCV64-NEXT:      Expressions re-written:
+;
+; X86_64-LABEL: 'non_power_2_storeloadforward'
+; X86_64-NEXT:    loop:
+; X86_64-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable)...
[truncated]

@alexey-bataev alexey-bataev requested a review from ayalz April 29, 2025 20:45
@alexey-bataev
Copy link
Member Author

Ping!

3 similar comments
@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

@artagnon artagnon changed the title [LV][LAA]Add initial support for non-power-of-2 store-load forwarding distance [LAA] Add initial support for non-power-of-2 store-load forwarding distance May 13, 2025
Copy link
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test updates look good.

MaxTargetVectorWidthInBits);
DepChecker = std::make_unique<MemoryDepChecker>(
*PSE, L, SymbolicStrides, MaxTargetVectorWidthInBits,
TTI && TTI->hasActiveVectorLength(0, nullptr, Align()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the prototype of hasActiveVectorLength accept arguments that are ignored by RISC-V? Is it overriden by any other target that use the arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PowerPC introduced this and supports for Loads/Stores. RISC-V supports all instructions, so it does not matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a rebase

Comment on lines 369 to 370
/// True if current target supports non-power-of-2 dependence distances.
bool AllowNonPow2Deps = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the comment say "if target supports predicated vector predicated intrinsics"?

Copy link
Member Author

@alexey-bataev alexey-bataev May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is correct. The fact that is supports predicated intrinsics does not mean it supports non-power-of-2 dep distance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the name/comment are accurate? Dependence could have any distance and still be supported, e.g. a forward dependene could have a distance of 3 which is totally fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AllowNonPow2StoreLoadForwardDistance?

With the comment update to clarify that this only applies to computing the store-load forward distance.

Comment on lines +335 to 343
/// Maximum number of elements, which do not prevent store-load forwarding,
/// multiplied by the size of the elements in bits (power-of-2).
uint64_t MaxPowerOf2StoreLoadForwardSafeDistanceInBits =
std::numeric_limits<uint64_t>::max();

/// Maximum number of elements, which do not prevent store-load forwarding,
/// multiplied by the size of the elements in bits (non-power-of-2).
uint64_t MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits =
std::numeric_limits<uint64_t>::max();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant to have both, when only one is ever going to be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know before the vectorization which one is going to be used. Need to keep both

Comment on lines +3042 to +3044
uint64_t SLDist = DC.getNonPowerOf2StoreLoadForwardSafeDistanceInBits();
if (SLDist == std::numeric_limits<uint64_t>::max())
SLDist = DC.getPowerOf2StoreLoadForwardSafeDistanceInBits();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic would be eliminated by not having two different fields.

8 * VectorizerParams::MaxVectorWidth / TypeByteSize) {
uint64_t MaxVF = MaxVFWithoutSLForwardIssuesNonPowerOf2 / CommonStride;
uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to compute this separately? Would it instead be possible to always compute the non-power-of-2 version and then have users convert it to the closest power-of-2 if that's what they need?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it won't work. I tried this, but there might be different results because of CommonStride value

@alexey-bataev
Copy link
Member Author

Ping!

1 similar comment
@alexey-bataev
Copy link
Member Author

Ping!

Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

3 similar comments
@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

MaxTargetVectorWidthInBits);
DepChecker = std::make_unique<MemoryDepChecker>(
*PSE, L, SymbolicStrides, MaxTargetVectorWidthInBits,
TTI && TTI->hasActiveVectorLength(0, nullptr, Align()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a rebase

Comment on lines 369 to 370
/// True if current target supports non-power-of-2 dependence distances.
bool AllowNonPow2Deps = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the name/comment are accurate? Dependence could have any distance and still be supported, e.g. a forward dependene could have a distance of 3 which is totally fine.

break;
}
}
// RISCV VLA supports non-power-2 vector factor. So, we iterate in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't mention RISCV, it is allowed if target has active vector length. Would be good to frame it as such

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

std::min(8 * VectorizerParams::MaxVectorWidth / TypeByteSize,
MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits);

for (uint64_t VF = MaxVFWithoutSLForwardIssuesNonPowerOf2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we count backwards here while forwards for the power-of-2 case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The power-of-2 case tries to find the minimal supported vector factor. For non-power-of-2 it tries to find the largest (but still legal) dep distance, so it goes backwards

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. But then it behaves different to MaxVFWithoutSLForwardIssuesPowerOf2. With MaxVFWithoutSLForwardIssuesPowerOf2 limits the Max VF we can use and we can also use any VF between 1 and MaxVF.

Is MaxVFWithoutSLForwardIssuesNonPowerOf2 a single non-power-of-2 VF we can use, but other VFs between 1 .. MaxVFWithoutSLForwardIssuesNonPowerOf2 may not be used?

Am I understanding correctly for example with, Max pow2 VF = 2, MaxNonPowOf2 VF = 9, LV can either chose 2 or 16 (with limiting VF to 9)?

If that's the case, I think it. would be good to update the comment for MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits to make this difference to the power-of-2 variant clear, as they would behave quite differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. But then it behaves different to MaxVFWithoutSLForwardIssuesPowerOf2. With MaxVFWithoutSLForwardIssuesPowerOf2 limits the Max VF we can use and we can also use any VF between 1 and MaxVF.

Not quite so. Any power-of-2 VF, but not any VF.

Is MaxVFWithoutSLForwardIssuesNonPowerOf2 a single non-power-of-2 VF we can use, but other VFs between 1 .. MaxVFWithoutSLForwardIssuesNonPowerOf2 may not be used?

Not necessary. We can use any whole divider of the MaxVFWithoutSLForwardIssuesNonPowerOf2. Say, if MaxVFWithoutSLForwardIssuesNonPowerOf2 is 9, then we can use 3 and 9. If it is 6, we can use 2, 3, 6. All these are safe.

Am I understanding correctly for example with, Max pow2 VF = 2, MaxNonPowOf2 VF = 9, LV can either chose 2 or 16 (with limiting VF to 9)?

The vector factor can be 2, 4, 8 or 16. But with non-power-of-2 we need an extra check ( or special instruction), that the number of the processed elements is limited by 9 or 3 elements only.

If that's the case, I think it. would be good to update the comment for MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits to make this difference to the power-of-2 variant clear, as they would behave quite differently.

Suggestions? Any preferences here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions? Any preferences here?

Thanks for the extra info. I think it would be helpful to update the comment to. include a generalized variant of the extra info you shared regarding what VFs can be picked (highlighted below)

Not necessary. We can use any whole divider of the MaxVFWithoutSLForwardIssuesNonPowerOf2. Say, if MaxVFWithoutSLForwardIssuesNonPowerOf2 is 9, then we can use 3 and 9. If it is 6, we can use 2, 3, 6. All these are safe.

I guess most RISCV HW with EVL support will support store-load forwarding with non-power-of-2 distances?

; CHECK-NEXT: Expressions re-written:
; RISCV64-LABEL: 'test_may_clobber2'
; RISCV64-NEXT: loop:
; RISCV64-NEXT: Memory dependences are safe with a maximum safe vector width of 576 bits, with a maximum safe store-load forward width of 192 bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this matches the behavior with power-of-2 I think.

If max safe distance non-pow-2 is 192, then shouldn't the max safe distance with pow-2 be 128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the case, where non-power-of-2 distance works, but power-of-2 does not. The distance here is 192 bits, which ends with vector factor 3. Vector factor 2 (for 128 bit) won't work here, because there will store-load forwarding conflict on iteration 4 (the distance between store and load is 9), so 128 bit for power-of-2 distance does not work, but 192 bit distance for non-power-of-2 works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks

@alexey-bataev
Copy link
Member Author

I am not sure the name/comment are accurate? Dependence could have any distance and still be supported, e.g. a forward dependene could have a distance of 3 which is totally fine.

What is your suggestion here? Any better name/comment?

Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

4 similar comments
@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

Comment on lines 369 to 370
/// True if current target supports non-power-of-2 dependence distances.
bool AllowNonPow2Deps = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AllowNonPow2StoreLoadForwardDistance?

With the comment update to clarify that this only applies to computing the store-load forward distance.

std::min(8 * VectorizerParams::MaxVectorWidth / TypeByteSize,
MaxNonPowerOf2StoreLoadForwardSafeDistanceInBits);

for (uint64_t VF = MaxVFWithoutSLForwardIssuesNonPowerOf2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions? Any preferences here?

Thanks for the extra info. I think it would be helpful to update the comment to. include a generalized variant of the extra info you shared regarding what VFs can be picked (highlighted below)

Not necessary. We can use any whole divider of the MaxVFWithoutSLForwardIssuesNonPowerOf2. Say, if MaxVFWithoutSLForwardIssuesNonPowerOf2 is 9, then we can use 3 and 9. If it is 6, we can use 2, 3, 6. All these are safe.

I guess most RISCV HW with EVL support will support store-load forwarding with non-power-of-2 distances?

; CHECK-NEXT: Expressions re-written:
; RISCV64-LABEL: 'test_may_clobber2'
; RISCV64-NEXT: loop:
; RISCV64-NEXT: Memory dependences are safe with a maximum safe vector width of 576 bits, with a maximum safe store-load forward width of 192 bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks

@alexey-bataev
Copy link
Member Author

I guess most RISCV HW with EVL support will support store-load forwarding with non-power-of-2 distances?

Not most, all. Otherwise, it is not a RISC-V.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most RISCV HW with EVL support will support store-load forwarding with non-power-of-2 distances?

Not most, all. Otherwise, it is not a RISC-V.

I don't follow, Store-Load fowarding usually is a uArch optimization technique, not a feature explicitly defined in the ISA?

@alexey-bataev
Copy link
Member Author

I guess most RISCV HW with EVL support will support store-load forwarding with non-power-of-2 distances?

Not most, all. Otherwise, it is not a RISC-V.

I don't follow, Store-Load fowarding usually is a uArch optimization technique, not a feature explicitly defined in the ISA?

I meant, non-power-of-2 vector length support. For store-load forwarding, yes, at least for the HW I'm aware of. Vendors can improve this later for their particular HW.

Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

2 similar comments
@alexey-bataev
Copy link
Member Author

Ping!

@alexey-bataev
Copy link
Member Author

Ping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants