Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26097,6 +26097,17 @@ static SDValue performSetCCPunpkCombine(SDNode *N, SelectionDAG &DAG) {
return SDValue();
}

static bool isSignExtInReg(const SDValue &V) {
if (V.getOpcode() != AArch64ISD::VASHR ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels quite late in the pipeline if we're relying upon AArch64 ISD nodes.

When lowering ctz_v16i1 I see this in the debug output:

Type-legalized selection DAG: %bb.0 'ctz_v16i1:'
SelectionDAG has 19 nodes:
  t0: ch,glue = EntryToken
          t12: nxv16i1 = AArch64ISD::PTRUE TargetConstant:i32<9>
              t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
            t23: v16i8 = sign_extend_inreg t2, ValueType:ch:v16i1
          t15: nxv16i8 = insert_subvector undef:nxv16i8, t23, Constant:i64<0>
          t17: nxv16i8 = splat_vector Constant:i32<0>
        t19: nxv16i1 = AArch64ISD::SETCC_MERGE_ZERO t12, t15, t17, setne:ch
      t20: i64 = AArch64ISD::CTTZ_ELTS t19
    t21: i32 = truncate t20
  t8: ch,glue = CopyToReg t0, Register:i32 $w0, t21
  t9: ch = AArch64ISD::RET_GLUE t8, Register:i32 $w0, t8:1

and there is a run of DAGCombiner immediately afterwards, which suggests that you can do this optimisation earlier and look for the SIGN_EXTEND_INREG node instead. In theory you should be able to make the codegen even better, essentially by doing:

  //    setcc_merge_zero(
  //       pred, insert_subvector(undef, signext_inreg(vNi1 x), 0), != splat(0))
  // => setcc_merge_zero(
  //       pred, insert_subvector(undef, x, 0), != splat(0))

That way you can get rid of the remaining shl instruction I think, which is also unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my original approach was to match the SIGN_EXTEND_INREG itself, rather than the expansion, as you suggest. The issue here seems to be that for some/most of the cases this is not sufficient to trigger the transformation. In the @llvm.experimental.cttz.elts tests this works as you show above, but for example in the @llvm.masked.load casees we expand the sign_extend_inreg node BEFORE we expand the masked_load node, so the pattern we're trying to match doesn't exist:

Vector/type-legalized selection DAG: %bb.0 'do_masked_load:'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
        t21: v16i8 = AArch64ISD::VSHL t4, Constant:i32<7>
      t22: v16i8 = AArch64ISD::VASHR t21, Constant:i32<7>
      t7: v16i8 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Con
stant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t22, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t9
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1

It seemed like matching the expansion of the SIGN_EXTEND_INREG made the most sense to catch all the cases, but maybe there's an alternative approach I'm not seeing that would still allow us to match the SIGN_EXTEND_INREG ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see, I'll have a think about it some more then. It just feels like the sort of thing we ought to be fixing earlier on using generic ISD nodes. In reality the predicate could come from two different sources:

  1. If this is a tail-folded loop then the predicate will be a PHI and so it will be a similar lowering problem to passing as a register argument to a function.
  2. In the loop we have a fcmp or icmp, which is used as the input for the masked load.

Ideally we'd be able to handle both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that we can encourage all paths into a canonical form that requires a single DAG combine. Your PR may effectively be doing that, just at the very last DAG combiner pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I understand this a bit more now ... If I lower these two functions:

define <16 x i8> @masked_load_v16i8(ptr %src, <16 x i1> %mask) {
  %load = call <16 x i8> @llvm.masked.load.v16i8(ptr %src, i32 8, <16 x i1> %mask, <16 x i8> zeroinitializer)
  ret <16 x i8> %load
}

define <16 x i8> @masked_load_v16i8_2(ptr %src, <16 x i8> %mask) {
  %icmp = icmp ugt <16 x i8> %mask, splat (i8 3)
  %load = call <16 x i8> @llvm.masked.load.v16i8(ptr %src, i32 8, <16 x i1> %icmp, <16 x i8> zeroinitializer)
  ret <16 x i8> %load
}

we actually end up with decent codegen for masked_load_v16i8_2:

masked_load_v16i8:
	shl	v0.16b, v0.16b, #7
	ptrue	p0.b, vl16
	cmlt	v0.16b, v0.16b, #0
	cmpne	p0.b, p0/z, z0.b, #0
	ld1b	{ z0.b }, p0/z, [x0]
	ret

masked_load_v16i8_2:
	movi	v1.16b, #3
	ptrue	p0.b, vl16
	cmphi	p0.b, p0/z, z0.b, z1.b
	ld1b	{ z0.b }, p0/z, [x0]
	ret

so the problem is purely limited to the case where the predicate is an unknown live-in for the block. I see what you mean about the ordering of lowering for masked_load_v16i8, i.e. we first see

Type-legalized selection DAG: %bb.0 'masked_load_v16i8:'
SelectionDAG has 14 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
      t16: v16i8 = sign_extend_inreg t4, ValueType:ch:v16i1
      t7: v16i8 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, $
    t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t16, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t9
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1

...

Vector-legalized selection DAG: %bb.0 'masked_load_v16i8:'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
        t21: v16i8 = AArch64ISD::VSHL t4, Constant:i32<7>
      t22: v16i8 = AArch64ISD::VASHR t21, Constant:i32<7>
      t7: v16i8 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, $
    t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t22, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t9
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1

then

Legalized selection DAG: %bb.0 'masked_load_v16i8:'
SelectionDAG has 23 nodes:
  t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t24: nxv16i1 = AArch64ISD::PTRUE TargetConstant:i32<9>
                t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
              t21: v16i8 = AArch64ISD::VSHL t4, Constant:i32<7>
            t22: v16i8 = AArch64ISD::VASHR t21, Constant:i32<7>
          t27: nxv16i8 = insert_subvector undef:nxv16i8, t22, Constant:i64<0>
        t30: nxv16i1 = AArch64ISD::SETCC_MERGE_ZERO t24, t27, t28, setne:ch
      t31: nxv16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t30, t28
    t32: v16i8 = extract_subvector t31, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t32
  t28: nxv16i8 = splat_vector Constant:i32<0>
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1

It feels like a shame we're expanding the sign_extend_inreg so early on. I wonder if a cleaner solution is to fold t16: v16i8 = sign_extend_inreg t4, ValueType:ch:v16i1 and t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t16, t7 into this:

`t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t4, t7`

That would remove the extends completely and hopefully lead to better codegen too, since it will also remove the VSHL. Can we do this in the DAG combine phase after Type-legalized selection DAG: %bb.0 'masked_load_v16i8:'. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes that seems like a neat solution!

This will leave us not catching the CTTZ_ELTS case, but that can be handled separately - I think we'd still need something along the lines of

  //    setcc_merge_zero(
  //       pred, insert_subvector(undef, signext_inreg(vNi1 x), 0), != splat(0))
  // => setcc_merge_zero(
  //       pred, insert_subvector(undef, x, 0), != splat(0))

for that one. Unless we did something when we expand the cttz.elts intrinsic perhaps, although I'm not sure how feasible that is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore my previous comments as I'd completely forgotten that for types such as v4i1 the upper bits can be undefined. The reason for the sign_extend_inreg is to ensure the promoted type v4i8 is well-defined. In this case you can't remove the sign_extend_inreg, so I think the approach you have here is probably the most viable. However, one minor suggestion - instead of simply removing VASHR and leaving the VSHL, would it be better instead to replace both nodes with a simple vector AND of 0x1? We only care about:

  1. Ensuring the elements are well-defined (both VSHL and AND achieve this goal), and
  2. The elements are non-zero.

I think on a few cores the vector AND has a higher throughput than SHL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes this is a nice approach and I can see that the throughput is better for AND on e.g. Neoverse V2 from the SWOG.

My one thought is - do we care about paying the cost of the additional MOVI #1 for the higher throughput AND? I think that on newer cores e.g. Neoverse V3 the throughput of SHL is equal to AND so maybe there we would favor just the SHL? Obviously there are more e.g. Neoverse V2 cores in the wild at present so I can see it would make sense to favor this, I'm just flagging this up!

V.getOperand(0).getOpcode() != AArch64ISD::VSHL)
return false;

unsigned BitWidth = V->getValueType(0).getScalarSizeInBits();
unsigned ShiftAmtR = V.getConstantOperandVal(1);
unsigned ShiftAmtL = V.getOperand(0).getConstantOperandVal(1);
return (ShiftAmtR == ShiftAmtL && ShiftAmtR == (BitWidth - 1));
}

static SDValue
performSetccMergeZeroCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
assert(N->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&
Expand Down Expand Up @@ -26137,6 +26148,27 @@ performSetccMergeZeroCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
LHS->getOperand(0), Pred);
}

// setcc_merge_zero(
// pred, insert_subvector(undef, signext_inreg(vNi1), 0), != splat(0))
// => setcc_merge_zero(
// pred, insert_subvector(undef, shl(vNi1), 0), != splat(0))
if (Cond == ISD::SETNE && isZerosVector(RHS.getNode()) &&
LHS->getOpcode() == ISD::INSERT_SUBVECTOR && LHS.hasOneUse()) {
SDValue L0 = LHS->getOperand(0);
SDValue L1 = LHS->getOperand(1);
SDValue L2 = LHS->getOperand(2);

if (L0.getOpcode() == ISD::UNDEF && isNullConstant(L2) &&
isSignExtInReg(L1)) {
SDLoc DL(N);
SDValue Shl = L1.getOperand(0);
SDValue NewLHS = DAG.getNode(ISD::INSERT_SUBVECTOR, DL,
LHS.getValueType(), L0, Shl, L2);
return DAG.getNode(AArch64ISD::SETCC_MERGE_ZERO, DL, N->getValueType(0),
Pred, NewLHS, RHS, N->getOperand(3));
}
}

return SDValue();
}

Expand Down
Loading