You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been tracking the performance of EVL tail folding in the loop vectorizer on RISC-V for a while now, and after much hard work from various contributors we think it should be generally profitable to enable by default now.
With tail folding there is a 21% improvement on 525.x264_r on SPEC CPU 2017 on the BPI-F3 (-march=rva22u64_v -O3 -flto), as well as a 30% geomean codesize reduction on SPEC and TSVC, with no significant regressions detected: https://lnt.lukelau.me/db_default/v4/nts/805?compare_to=804
Now that we are early into the LLVM 22.x development cycle it seems like a good time to enable it to catch any issues. There are still more EVL related items of work being tracked in llvm#123069, which should continue to improve performance.
For the test diffs in this PR, I've avoided changing RUN lines where possible so that the tests accurately reflect the default configuration.
Most test diffs are what we would expected from turning on tail folding, but I've gone through and taken some notes for the interesting changes:
- blend-any-of-reduction-cost.ll: This test was added to fix a legacy and vplan cost model mismatch assert. The costing path is still being exercised, but the VPlan in the second test is no longer considered profitable.
- fminimumnum.ll: Most tests fall back to a scalar epilogue because of a failing EVL legality check. I don't know why the umax changed from 15 to 16, but it shouldn't make a difference if the minimum trip count is 4096 (which seems way too high to begin with?)
- interleaved-accesses.ll: These tests fall back to a gather/scatter but should be fixed soon by llvm#151665
- partial-reduce.ll: Given that partial reductions bail out with EVL tail folding, I've disabled tail folding in these tests for now and left a TODO. Tracked by llvm#151438
- short-trip-count.ll: These are no longer vectorized as they're below TTI().getMinTripCountTailFoldingThreshold, which was set to 5 in llvm#132176. Previously we continued to vectorize even these low trip counts if they were a known power of 2, due to existing logic in the loop vectorizer overriding the tail folding style to CM_ScalarEpilogueNotAllowedLowTripLoop whenever there was no tail folding. I believe this is working as intended now but no longer vectorizing the loop.
; ZVFBFMIN-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x bfloat>, ptr [[TMP7]], align 2
158
-
; ZVFBFMIN-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 4 x bfloat>, ptr [[TMP8]], align 2
159
-
; ZVFBFMIN-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 4 x float>, ptr [[TMP9]], align 4
160
+
; ZVFBFMIN-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 4 x bfloat> @llvm.vp.load.nxv4bf16.p0(ptr align 2 [[TMP7]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP11]])
161
+
; ZVFBFMIN-NEXT: [[WIDE_LOAD1:%.*]] = call <vscale x 4 x bfloat> @llvm.vp.load.nxv4bf16.p0(ptr align 2 [[TMP8]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP11]])
162
+
; ZVFBFMIN-NEXT: [[WIDE_LOAD2:%.*]] = call <vscale x 4 x float> @llvm.vp.load.nxv4f32.p0(ptr align 4 [[TMP9]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP11]])
160
163
; ZVFBFMIN-NEXT: [[TMP13:%.*]] = fpext <vscale x 4 x bfloat> [[WIDE_LOAD]] to <vscale x 4 x float>
161
164
; ZVFBFMIN-NEXT: [[TMP14:%.*]] = fpext <vscale x 4 x bfloat> [[WIDE_LOAD1]] to <vscale x 4 x float>
162
165
; ZVFBFMIN-NEXT: [[TMP15:%.*]] = call <vscale x 4 x float> @llvm.fmuladd.nxv4f32(<vscale x 4 x float> [[TMP13]], <vscale x 4 x float> [[TMP14]], <vscale x 4 x float> [[WIDE_LOAD2]])
163
-
; ZVFBFMIN-NEXT: store <vscale x 4 x float> [[TMP15]], ptr [[TMP9]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = mul nuw i64 [[TMP4]], 2
76
-
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C_1]], i64 0
77
-
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
78
-
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C_0]], i64 0
79
-
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
80
-
; CHECK-NEXT: [[TMP6:%.*]] = xor <vscale x 2 x i1> [[BROADCAST_SPLAT]], splat (i1 true)
81
-
; CHECK-NEXT: [[TMP7:%.*]] = xor <vscale x 2 x i1> [[BROADCAST_SPLAT2]], splat (i1 true)
82
-
; CHECK-NEXT: [[TMP8:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i1> [[TMP7]], <vscale x 2 x i1> zeroinitializer
83
-
; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 2 x ptr> poison, ptr [[SRC]], i64 0
84
-
; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 2 x ptr> [[BROADCAST_SPLATINSERT3]], <vscale x 2 x ptr> poison, <vscale x 2 x i32> zeroinitializer
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 2 x i1> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[PREDPHI:%.*]], %[[VECTOR_BODY]] ]
89
-
; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 2 x ptr> @llvm.masked.gather.nxv2p0.nxv2p0(<vscale x 2 x ptr> [[BROADCAST_SPLAT4]], i32 8, <vscale x 2 x i1> [[TMP8]], <vscale x 2 x ptr> poison)
90
-
; CHECK-NEXT: [[TMP9:%.*]] = icmp eq <vscale x 2 x ptr> [[WIDE_MASKED_GATHER]], zeroinitializer
91
-
; CHECK-NEXT: [[TMP10:%.*]] = or <vscale x 2 x i1> [[VEC_PHI]], [[TMP9]]
92
-
; CHECK-NEXT: [[PREDPHI]] = select <vscale x 2 x i1> [[TMP8]], <vscale x 2 x i1> [[TMP10]], <vscale x 2 x i1> [[VEC_PHI]]
0 commit comments