[LoopVectorize][AArch64] - Missed Loop Vectorization for Neon target due to high cost of scalarization of sdiv, for mcpu=oryon-1.

Summary

For the following IR test case, Loop vectorizer blocks vectorization of this loop due to high cost of scalar sdiv operation. Manually enabling vectorization using #pragma improves performance considerably with a VF = 8 about 17% improvement over baseline Neon score, but VF={2,4} are also profitable over non-vectorized loop. Or Simply Interleaving with IF={4,8} is equally profitable as vectorizing with VF = 8.



Relevant logs : 
```
Cost of 28 for VF 8: profitable to scalarize   %div65 = sdiv i32 %add61, %76
Cost of 28 for VF 8: profitable to scalarize   %div = sdiv i32 %add, %76
```

In a subsequent experiment I tried to block scalarization and instead prefer widening using some cost hacks, the widening for sdiv VF=8 according to Vplan cost model is worse than scalarization. But bypassing widening costs and trying to force widening is even more profitable upto 30% over baseline Neon score. So I tried to study why widening is currently not profitable as per Vplan cost model and following is the analysis. 

Now even if widening was preferred by loop vectorizer, the widening cost of sdiv came too high, higher than scalarization of sdivs, this I believe came due to cost model unfairly computing higher costs particularly scalarization overheads of extract and insert elements. 
Here VecInstCost is 2 consider for extract, -lly for insert. DemandedElts gives 8 same as the VF, (insert + extract) is 1 ( cost for either insert/extract ) x 2 = 16, same is computed for insert as 16. This is summed as scalarization overhead = 32, and then added + 8 as VF = 8 x cost of scalar sdiv which is 1 = 8. Total = 40. 

This cost of 40, is then returned to AArchTTI::
getArithmeticInstrCost of sdiv which again counts cost of inserts/extracts overheads. getVectorInstrCost for each insert and extract returns 2. And Cost goes upto 44.
```
// On AArch64, without SVE, vector divisions are expanded
        // into scalar divisions of each pair of elements.
        Cost += getVectorInstrCost(Instruction::ExtractElement, Ty, CostKind,
                                   -1, nullptr, nullptr);
        Cost += getVectorInstrCost(Instruction::InsertElement, Ty, CostKind, -1,
                                   nullptr, nullptr); 
```
```
unsigned VecInstCost =
      CostKind == TTI::TCK_CodeSize ? 1 : ST->getVectorInsertExtractBaseCost();
  return DemandedElts.popcount() * (Insert + Extract) * VecInstCost; 
```

Finally this cost of 44, is then doubled as I understand it, this single cost was for single operand of sdiv, then doubling it will take the cost for both operands. Making final widening cost 88 for VF = 8. 
```
// TODO: if one of the arguments is scalar, then it's not necessary to      // double the cost of handling the vector elements.      
Cost += Cost; 
```
There seem to be multiple issues with this, one this is too complicated and relies on legacy cost model computations, I try to simplify this with The cost of scalarization of sdiv in TTI as simply : 2 * ExtractCost + 1 * InsertCost + SdivScalarCost * VF following getVectorInstrCost costing approach which is recent. 

The Cost of insert and total cost of sdiv = VF x Cost of scalar sdiv is added twice, In the Total cost of 88, and on top of that finally, an additional cost of 2 + 2 ( insert + extract overhead) is added again even though it was computed earlier in the cost of 40. Simplifying the cost model of scalarization of sdiv in Vplan path, I redid the cost to simply, 2 * ExtractCost + 1 * InsertCost + ScalarDivCost * VF which according to getVectorInstrCost for ExtractCost/InsertCost comes to be = 2 * 2 + 1 * 2 + 1 * 8 = 14. Making widening profitable. 

The reason widening is even more profitable than scalarization of sdivs is it avoids branching in this particular loop. Scalarization produces a lot of branches which are avoided in widening final assembly, even though the sdivs are finally scalar, the benefit comes from rest of the loop being vectorized particularly loads/stores and some arithmetic ops like add/sub. 

In a nutshell : Loop vectorizer does not take into account widening cost to decide to either widen or scalarize hence it decides to scalarize, but finally finds none of the vector factors as profitable for vectorization, but manually enabling vectorization using #pragma proves vectorization is beneficial, however it does not then take into account widening maybe more beneficial, once we try to nudge it to consider widening, then the cost model over-estimates widening cost and disregards widening as well. 

Reduced test case. 
```
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable
define dso_local void @quux(ptr noundef readonly captures(none) %arg, ptr noundef readonly captures(none) %arg1, ptr noundef writeonly captures(none) %arg2) local_unnamed_addr #0 {
bb:
  br label %bb3

bb3:                                              ; preds = %bb23, %bb
  %phi = phi i32 [ 0, %bb ], [ %add24, %bb23 ]
  br label %bb4

bb4:                                              ; preds = %bb18, %bb3
  %phi5 = phi i64 [ 0, %bb3 ], [ %add21, %bb18 ]
  %getelementptr = getelementptr inbounds nuw i32, ptr %arg, i64 %phi5
  %load = load i32, ptr %getelementptr, align 4, !tbaa !5
  %getelementptr6 = getelementptr inbounds nuw i32, ptr %arg1, i64 %phi5
  %load7 = load i32, ptr %getelementptr6, align 4, !tbaa !5
  %icmp = icmp slt i32 %load7, 0
  %ashr = ashr i32 %load, 1
  br i1 %icmp, label %bb8, label %bb14

bb8:                                              ; preds = %bb4
  %sub = sub nsw i32 %ashr, %load7
  %icmp9 = icmp slt i32 %sub, %load
  br i1 %icmp9, label %bb11, label %bb10

bb10:                                             ; preds = %bb8
  %sdiv = sdiv i32 %sub, %load
  br label %bb11

bb11:                                             ; preds = %bb10, %bb8
  %phi12 = phi i32 [ %sdiv, %bb10 ], [ 0, %bb8 ]
  %sub13 = sub nsw i32 0, %phi12
  br label %bb18

bb14:                                             ; preds = %bb4
  %add = add nsw i32 %load7, %ashr
  %icmp15 = icmp slt i32 %add, %load
  br i1 %icmp15, label %bb18, label %bb16

bb16:                                             ; preds = %bb14
  %sdiv17 = sdiv i32 %add, %load
  br label %bb18

bb18:                                             ; preds = %bb16, %bb14, %bb11
  %phi19 = phi i32 [ %sub13, %bb11 ], [ %sdiv17, %bb16 ], [ 0, %bb14 ]
  %trunc = trunc i32 %phi19 to i16
  %getelementptr20 = getelementptr inbounds nuw i16, ptr %arg2, i64 %phi5
  store i16 %trunc, ptr %getelementptr20, align 2, !tbaa !9
  %add21 = add nuw nsw i64 %phi5, 1
  %icmp22 = icmp eq i64 %add21, 64
  br i1 %icmp22, label %bb23, label %bb4, !llvm.loop !11

bb23:                                             ; preds = %bb18
  %add24 = add nuw nsw i32 %phi, 1
  %icmp25 = icmp eq i32 %add24, 8
  br i1 %icmp25, label %bb26, label %bb3, !llvm.loop !13

bb26:                                             ; preds = %bb23
  ret void
}

attributes #0 = { nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="oryon-1" "target-features"="+aes,+bf16,+bti,+ccidx,+complxnum,+crc,+dit,+dotprod,+flagm,+fp-armv8,+i8mm,+jsconv,+lse,+neon,+outline-atomics,+pauth,+perfmon,+predres,+rand,+ras,+rcpc,+rdm,+sb,+sha2,+sha3,+sm4,+spe,+ssbs,+v8.1a,+v8.2a,+v8.3a,+v8.4a,+v8.5a,+v8.6a,+v8a,-fmv" "unsafe-fp-math"="true" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 1}
!5 = !{!6, !6, i64 0}
!6 = !{!"int", !7, i64 0}
!7 = !{!"omnipotent char", !8, i64 0}
!8 = !{!"Simple C/C++ TBAA"}
!9 = !{!10, !10, i64 0}
!10 = !{!"short", !7, i64 0}
!11 = distinct !{!11, !12}
!12 = !{!"llvm.loop.unroll.disable"}
!13 = distinct !{!13, !12}
```

If there are any questions or need more clarity on anything here, please revert back with your questions, looking forward to discuss more. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LoopVectorize][AArch64] - Missed Loop Vectorization for Neon target due to high cost of scalarization of sdiv, for mcpu=oryon-1. #170821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[LoopVectorize][AArch64] - Missed Loop Vectorization for Neon target due to high cost of scalarization of sdiv, for mcpu=oryon-1. #170821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions