- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
[VPlan] Don't use the legacy cost model for loop conditions #156864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 10 commits
861d883
              c3f2e8f
              77a2769
              19b984c
              2b83698
              ec22b08
              1bf3611
              2a6c490
              6e23400
              24fa802
              515264b
              50b801f
              2d3999b
              33b76ce
              8c21ea3
              759741c
              60b76a2
              8a4c2ad
              ae498b8
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| 
          
            
          
           | 
    @@ -1151,6 +1151,32 @@ InstructionCost VPInstruction::computeCost(ElementCount VF, | |
| return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement, | ||
| VecTy, Ctx.CostKind, 0); | ||
| } | ||
| case VPInstruction::BranchOnCount: { | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think BranchOnCond needs adding too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BranchOnCond doesn't cause a compare instruction to be generated, it uses the condition generated by another instruction.  | 
||
| // If TC <= VF then this is just a branch. | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this means. Are you saying that we create a vplan for a given VF despite knowing that we will never enter the vector loop? I guess this can happen if TC is exactly equal to VF or we're using tail-folding, but not using the mask for control flow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned in the comment below this, this transformation is happening in simplifyBranchConditionForVFAndUF and means the vector loop is executed exactly once. TC < VF can happen with tail folding, e.g. low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.  | 
||
| // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you're saying the cost of this branch is based on a prediction about what  For example, if you pulled this code out of  You'd also now be able to remove the  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll look into doing that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that's the way to go, we shouldn't duplicate that kind of reasoning here. The TODO should say that the branch should be simplified before we compute the costs. As a workaround for now catching some cases here should be fine, as this should only mean we may miss some new optimizations, but not make things worse There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I do think we should be dealing with trip counts that are multiples of vscale here too since we now support them and we know that  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After looking into this, the problem here is that we don't have access to the ScalarEvolution object here in VPInstruction, so putting code from simplifyBranchConditionForVFAndUF into a function and calling it won't work as that makes use of ScalarEvolution. Simplifying the branch before we compute the cost seems like a good solution to this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here.  | 
||
| // where it checks TC <= VF * UF, but we don't know UF yet. This means in | ||
| // some cases we get a cost that's too high due to counting a cmp that | ||
| // later gets removed. | ||
| // FIXME: The compare could also be removed if TC = M * vscale, | ||
| // VF = N * vscale, and M <= N. Detecting that would require having the | ||
| // trip count as a SCEV though. | ||
| Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue(); | ||
| ConstantInt *TCConst = dyn_cast_if_present<ConstantInt>(TC); | ||
| if (TCConst && TCConst->getValue().ule(VF.getKnownMinValue())) | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a TODO for the case where TC=vscale x M and VF=vscale * N as well? In such cases we should also be able to prove that TC <= VF because it just requires asking if M <= N. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do.  | 
||
| return 0; | ||
| // Otherwise BranchOnCount generates ICmpEQ followed by a branch. | ||
| Type *ValTy = Ctx.Types.inferScalarType(getOperand(0)); | ||
| return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy, | ||
| CmpInst::makeCmpResultType(ValTy), | ||
| CmpInst::ICMP_EQ, Ctx.CostKind); | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to add the cost of the branch as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've deliberately avoided touching branch costs to keep the scope of this work as small as possible.  | 
||
| } | ||
| case Instruction::FCmp: | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change below is about more than just about the loop conditions. See  I think you either need to: 
 Sorry about this! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. getCostForRecipeWithOpcode already correctly handles vector compares, so I've changed this to use that function.  | 
||
| case Instruction::ICmp: { | ||
| Type *ValTy = Ctx.Types.inferScalarType(getOperand(0)); | ||
| return Ctx.TTI.getCmpSelInstrCost(getOpcode(), ValTy, | ||
| CmpInst::makeCmpResultType(ValTy), | ||
| getPredicate(), Ctx.CostKind); | ||
| } | ||
| case VPInstruction::ExtractPenultimateElement: | ||
| if (VF == ElementCount::getScalable(1)) | ||
| return InstructionCost::getInvalid(); | ||
| 
          
            
          
           | 
    ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| 
          
            
          
           | 
    @@ -534,25 +534,47 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 { | |
| ; DEFAULT-LABEL: define void @multiple_exit_conditions( | ||
| ; DEFAULT-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) #[[ATTR2:[0-9]+]] { | ||
| ; DEFAULT-NEXT: [[ENTRY:.*:]] | ||
| ; DEFAULT-NEXT: br label %[[VECTOR_PH:.*]] | ||
| ; DEFAULT-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; DEFAULT-NEXT: [[TMP4:%.*]] = shl nuw i64 [[TMP0]], 3 | ||
| ; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 257, [[TMP4]] | ||
| ; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]] | ||
| ; DEFAULT: [[VECTOR_PH]]: | ||
| ; DEFAULT-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 2048 | ||
| ; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]] | ||
| ; DEFAULT: [[VECTOR_BODY]]: | ||
| ; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ] | ||
| ; DEFAULT-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; DEFAULT-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 8 | ||
| ; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 257, [[TMP3]] | ||
| ; DEFAULT-NEXT: [[INDEX:%.*]] = sub i64 257, [[N_MOD_VF]] | ||
| ; DEFAULT-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8 | ||
| ; DEFAULT-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]] | ||
| ; DEFAULT-NEXT: [[TMP6:%.*]] = mul i64 [[INDEX]], 2 | ||
| ; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]] | ||
| ; DEFAULT: [[VECTOR_BODY]]: | ||
| ; DEFAULT-NEXT: [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ] | ||
| ; DEFAULT-NEXT: [[OFFSET_IDX1:%.*]] = mul i64 [[INDEX1]], 8 | ||
| ; DEFAULT-NEXT: [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX1]] | ||
| ; DEFAULT-NEXT: [[TMP1:%.*]] = load i16, ptr [[SRC]], align 2 | ||
| ; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i16> poison, i16 [[TMP1]], i64 0 | ||
| ; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer | ||
| ; DEFAULT-NEXT: [[TMP2:%.*]] = or <8 x i16> [[BROADCAST_SPLAT]], splat (i16 1) | ||
| ; DEFAULT-NEXT: [[TMP3:%.*]] = uitofp <8 x i16> [[TMP2]] to <8 x double> | ||
| ; DEFAULT-NEXT: store <8 x double> [[TMP3]], ptr [[NEXT_GEP]], align 8 | ||
| ; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8 | ||
| ; DEFAULT-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256 | ||
| ; DEFAULT-NEXT: br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]] | ||
| ; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i16> poison, i16 [[TMP1]], i64 0 | ||
| ; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer | ||
| ; DEFAULT-NEXT: [[TMP11:%.*]] = or <vscale x 2 x i16> [[BROADCAST_SPLAT]], splat (i16 1) | ||
| ; DEFAULT-NEXT: [[TMP15:%.*]] = uitofp <vscale x 2 x i16> [[TMP11]] to <vscale x 2 x double> | ||
| ; DEFAULT-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; DEFAULT-NEXT: [[TMP17:%.*]] = shl nuw i64 [[TMP16]], 1 | ||
| ; DEFAULT-NEXT: [[TMP18:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP17]] | ||
| ; DEFAULT-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; DEFAULT-NEXT: [[TMP20:%.*]] = shl nuw i64 [[TMP19]], 2 | ||
| ; DEFAULT-NEXT: [[TMP21:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP20]] | ||
| ; DEFAULT-NEXT: [[TMP22:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; DEFAULT-NEXT: [[TMP23:%.*]] = mul nuw i64 [[TMP22]], 6 | ||
| ; DEFAULT-NEXT: [[TMP24:%.*]] = getelementptr double, ptr [[NEXT_GEP1]], i64 [[TMP23]] | ||
| ; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[NEXT_GEP1]], align 8 | ||
| ; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP18]], align 8 | ||
| ; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP21]], align 8 | ||
| ; DEFAULT-NEXT: store <vscale x 2 x double> [[TMP15]], ptr [[TMP24]], align 8 | ||
| ; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX1]], [[TMP3]] | ||
| ; DEFAULT-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[INDEX]] | ||
| ; DEFAULT-NEXT: br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]] | ||
| ; DEFAULT: [[MIDDLE_BLOCK]]: | ||
| ; DEFAULT-NEXT: br label %[[SCALAR_PH:.*]] | ||
| ; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 257, [[INDEX]] | ||
| ; DEFAULT-NEXT: br i1 [[CMP_N]], [[EXIT:label %.*]], label %[[SCALAR_PH]] | ||
| ; DEFAULT: [[SCALAR_PH]]: | ||
| ; | ||
| ; PRED-LABEL: define void @multiple_exit_conditions( | ||
| 
          
            
          
           | 
    @@ -666,10 +688,10 @@ define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) { | |
| ; COMMON-NEXT: store i8 7, ptr [[TMP7]], align 1 | ||
| ; COMMON-NEXT: br label %[[EXIT1]] | ||
| ; COMMON: [[EXIT1]]: | ||
| ; COMMON-NEXT: br label %[[SCALAR_PH1:.*]] | ||
| ; COMMON: [[SCALAR_PH1]]: | ||
| ; COMMON-NEXT: br label %[[SCALAR_PH:.*]] | ||
| ; COMMON: [[SCALAR_PH]]: | ||
| ; COMMON-NEXT: br [[EXIT:label %.*]] | ||
| ; COMMON: [[SCALAR_PH:.*:]] | ||
| ; COMMON: [[SCALAR_PH1:.*:]] | ||
| ; | ||
| entry: | ||
| br label %loop | ||
| 
          
            
          
           | 
    @@ -1302,7 +1324,7 @@ define void @pred_udiv_select_cost(ptr %A, ptr %B, ptr %C, i64 %n, i8 %y) #1 { | |
| ; PRED-NEXT: br label %[[VECTOR_MEMCHECK:.*]] | ||
| ; PRED: [[VECTOR_MEMCHECK]]: | ||
| ; PRED-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; PRED-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP1]], 16 | ||
| ; PRED-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP1]], 4 | ||
| ; PRED-NEXT: [[TMP3:%.*]] = sub i64 [[C1]], [[A2]] | ||
| ; PRED-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP2]] | ||
| ; PRED-NEXT: [[TMP4:%.*]] = sub i64 [[C1]], [[B3]] | ||
| 
        
          
        
         | 
    @@ -1311,42 +1333,42 @@ define void @pred_udiv_select_cost(ptr %A, ptr %B, ptr %C, i64 %n, i8 %y) #1 { | |
| ; PRED-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]] | ||
| ; PRED: [[VECTOR_PH]]: | ||
| ; PRED-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; PRED-NEXT: [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 16 | ||
| ; PRED-NEXT: [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 4 | ||
| ; PRED-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64() | ||
| ; PRED-NEXT: [[TMP8:%.*]] = shl nuw i64 [[TMP7]], 4 | ||
| ; PRED-NEXT: [[TMP8:%.*]] = shl nuw i64 [[TMP7]], 2 | ||
| ; PRED-NEXT: [[TMP9:%.*]] = sub i64 [[TMP0]], [[TMP8]] | ||
| ; PRED-NEXT: [[TMP10:%.*]] = icmp ugt i64 [[TMP0]], [[TMP8]] | ||
| ; PRED-NEXT: [[TMP11:%.*]] = select i1 [[TMP10]], i64 [[TMP9]], i64 0 | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[TMP0]]) | ||
| ; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i8> poison, i8 [[Y]], i64 0 | ||
| ; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP0]]) | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like the reason why this test has changed behaviour is due to us previously adding on the cost of the original scalar exit condition (an icmp) when in reality the vplan doesn't have one. Nice! I think it's the same conclusion you came to with the induction-costs-sve.ll test below.  | 
||
| ; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i8> poison, i8 [[Y]], i64 0 | ||
| ; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer | ||
| ; PRED-NEXT: br label %[[VECTOR_BODY:.*]] | ||
| ; PRED: [[VECTOR_BODY]]: | ||
| ; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ] | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ] | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ] | ||
| ; PRED-NEXT: [[TMP12:%.*]] = getelementptr i8, ptr [[A]], i64 [[INDEX]] | ||
| ; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison) | ||
| ; PRED-NEXT: [[TMP13:%.*]] = uitofp <vscale x 16 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 16 x float> | ||
| ; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison) | ||
| ; PRED-NEXT: [[TMP13:%.*]] = uitofp <vscale x 4 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 4 x float> | ||
| ; PRED-NEXT: [[TMP14:%.*]] = getelementptr i8, ptr [[B]], i64 [[INDEX]] | ||
| ; PRED-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP14]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison) | ||
| ; PRED-NEXT: [[TMP15:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_MASKED_LOAD5]], zeroinitializer | ||
| ; PRED-NEXT: [[TMP16:%.*]] = select <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i1> [[TMP15]], <vscale x 16 x i1> zeroinitializer | ||
| ; PRED-NEXT: [[TMP17:%.*]] = xor <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1) | ||
| ; PRED-NEXT: [[TMP18:%.*]] = select <vscale x 16 x i1> [[TMP16]], <vscale x 16 x i8> [[BROADCAST_SPLAT]], <vscale x 16 x i8> splat (i8 1) | ||
| ; PRED-NEXT: [[TMP19:%.*]] = udiv <vscale x 16 x i8> [[TMP17]], [[TMP18]] | ||
| ; PRED-NEXT: [[TMP20:%.*]] = icmp ugt <vscale x 16 x i8> [[TMP19]], splat (i8 1) | ||
| ; PRED-NEXT: [[TMP21:%.*]] = select <vscale x 16 x i1> [[TMP20]], <vscale x 16 x i32> zeroinitializer, <vscale x 16 x i32> splat (i32 255) | ||
| ; PRED-NEXT: [[PREDPHI:%.*]] = select <vscale x 16 x i1> [[TMP15]], <vscale x 16 x i32> [[TMP21]], <vscale x 16 x i32> zeroinitializer | ||
| ; PRED-NEXT: [[TMP22:%.*]] = zext <vscale x 16 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 16 x i32> | ||
| ; PRED-NEXT: [[TMP23:%.*]] = sub <vscale x 16 x i32> [[PREDPHI]], [[TMP22]] | ||
| ; PRED-NEXT: [[TMP24:%.*]] = sitofp <vscale x 16 x i32> [[TMP23]] to <vscale x 16 x float> | ||
| ; PRED-NEXT: [[TMP25:%.*]] = call <vscale x 16 x float> @llvm.fmuladd.nxv16f32(<vscale x 16 x float> [[TMP24]], <vscale x 16 x float> splat (float 3.000000e+00), <vscale x 16 x float> [[TMP13]]) | ||
| ; PRED-NEXT: [[TMP26:%.*]] = fptoui <vscale x 16 x float> [[TMP25]] to <vscale x 16 x i8> | ||
| ; PRED-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP14]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison) | ||
| ; PRED-NEXT: [[TMP15:%.*]] = icmp ne <vscale x 4 x i8> [[WIDE_MASKED_LOAD5]], zeroinitializer | ||
| ; PRED-NEXT: [[TMP16:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i1> zeroinitializer | ||
| ; PRED-NEXT: [[TMP17:%.*]] = xor <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1) | ||
| ; PRED-NEXT: [[TMP18:%.*]] = select <vscale x 4 x i1> [[TMP16]], <vscale x 4 x i8> [[BROADCAST_SPLAT]], <vscale x 4 x i8> splat (i8 1) | ||
| ; PRED-NEXT: [[TMP19:%.*]] = udiv <vscale x 4 x i8> [[TMP17]], [[TMP18]] | ||
| ; PRED-NEXT: [[TMP20:%.*]] = icmp ugt <vscale x 4 x i8> [[TMP19]], splat (i8 1) | ||
| ; PRED-NEXT: [[TMP21:%.*]] = select <vscale x 4 x i1> [[TMP20]], <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32> splat (i32 255) | ||
| ; PRED-NEXT: [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> [[TMP21]], <vscale x 4 x i32> zeroinitializer | ||
| ; PRED-NEXT: [[TMP22:%.*]] = zext <vscale x 4 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 4 x i32> | ||
| ; PRED-NEXT: [[TMP23:%.*]] = sub <vscale x 4 x i32> [[PREDPHI]], [[TMP22]] | ||
| ; PRED-NEXT: [[TMP24:%.*]] = sitofp <vscale x 4 x i32> [[TMP23]] to <vscale x 4 x float> | ||
| ; PRED-NEXT: [[TMP25:%.*]] = call <vscale x 4 x float> @llvm.fmuladd.nxv4f32(<vscale x 4 x float> [[TMP24]], <vscale x 4 x float> splat (float 3.000000e+00), <vscale x 4 x float> [[TMP13]]) | ||
| ; PRED-NEXT: [[TMP26:%.*]] = fptoui <vscale x 4 x float> [[TMP25]] to <vscale x 4 x i8> | ||
| ; PRED-NEXT: [[TMP27:%.*]] = getelementptr i8, ptr [[C]], i64 [[INDEX]] | ||
| ; PRED-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP26]], ptr [[TMP27]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]]) | ||
| ; PRED-NEXT: call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP26]], ptr [[TMP27]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]]) | ||
| ; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]] | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP11]]) | ||
| ; PRED-NEXT: [[TMP28:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0 | ||
| ; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP11]]) | ||
| ; PRED-NEXT: [[TMP28:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0 | ||
| ; PRED-NEXT: [[TMP29:%.*]] = xor i1 [[TMP28]], true | ||
| ; PRED-NEXT: br i1 [[TMP29]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]] | ||
| ; PRED: [[MIDDLE_BLOCK]]: | ||
| 
          
            
          
           | 
    ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vp_depth_first_deepwill also leave the region and visit its successors, so we will also count the compare in the middle block, and almost always overcount the compares in VPLan. probably needs to check if we left the region.I think it will also always disable the check if we have loops controlled by active-lane-mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to ignore blocks outside of the vector loop region.
I'm not sure what you're asking here. When we have a loop that's using an active-lane-mask, the vplan will have something like (example here taken from llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll)
the legacy cost model will have
planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, what I was wondering was if we could exclude plans with ActiveLaneMask terminated exiting blocks from the carve-out, to preserve the original check for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's still not clear what you're asking here. Do you mean: planContainsDifferentCompares should return false for plans that contain ActiveLaneMask terminated exiting blocks, so that the assert in computeBestVF that calls planContainsDifferentCompares will do the check BestFactor.Width == LegacyVF.Width? If so then possibly we could, though I don't think it would make a difference as I haven't found an example where planContainsAdditionalSimplifications doesn't also return true (which it will do because the cmp in the legacy cost model doesn't correspond to anything in the vplan).