-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[PowerPC] fold i128 equality/inequality compares of two loads into a vectorized compare using vcmpequb.p when Altivec is available #158657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-backend-powerpc Author: zhijian lin (diggerlin) ChangesFor For example when we compile this: |
| if (Subtarget.hasVSX()) { | ||
| if (LHS.getOpcode() == ISD::LOAD && RHS.getOpcode() == ISD::LOAD && | ||
| LHS.hasOneUse() && RHS.hasOneUse() && | ||
| LHS.getValueType() == MVT::i128 && RHS.getValueType() == MVT::i128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the type restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the pass expand-memcmp
%bcmp = tail call i32 @bcmp(ptr noundef nonnull dereferenceable(16) %a, ptr noundef nonnull dereferenceable(16) %b, i64 16)
%cmp = icmp eq i32 %bcmp, 0
%conv = zext i1 %cmp to i32
ret i32 %conv
is changed to
%0 = load i128, ptr %a, align 1
%1 = load i128, ptr %b, align 1
%2 = icmp ne i128 %0, %1
%3 = zext i1 %2 to i32
%cmp = icmp eq i32 %3, 0
%conv = zext i1 %cmp to i32
ret i32 %conv
but in original code, the load i128, ptr %a, align 1 is lowered to
t27: i64,ch = load<(load (s64) from %ir.a, align 1)> t0, t2, undef:i64
t32: i64,ch = load<(load (s64) from %ir.b, align 1)> t0, t4, undef:i64
in 64-bit mode, it is not efficient with two ld instruction in 64-bit mode or four lwz in 32-bit mode.
we want to i128 to be converted to vector load. so there is type restriction.
| auto *LA = dyn_cast<LoadSDNode>(LHS); | ||
| auto *LB = dyn_cast<LoadSDNode>(RHS); | ||
| if (!LA || !LB) | ||
| return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't all conditions that are not meet for this optimization results in the default behaviour for this function on line 15618 below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I will fix it, thanks, I thought that the ISD::SETCC only return i1, and the function PPCTargetLowering::DAGCombineTruncBoolExt only deal with i32 and i64, so I return SDValue() here directly. but the ISD::SETCC maybe return i32/i64 too.
| SDValue Add = DAG.getNode(ISD::ADD, DL, OpVT, LHS, RHS.getOperand(1)); | ||
| return DAG.getSetCC(DL, VT, Add, DAG.getConstant(0, DL, OpVT), CC); | ||
| } | ||
| if (Subtarget.hasVSX()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a documentation as to the type of optimization that is being done in this block.
| SDValue Ops[] = {IntrID, CRSel, LHSVec, RHSVec}; | ||
| SDValue PredResult = | ||
| DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32, Ops); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops[] is not needed, just inine it to the call?
|
gentle ping. |
|
The PR description should probably say that the case optimized is not just known length but EQ/NE only. Also that it extends existing handling to 64 bit up to 128 bits. |
|
|
||
| entry: | ||
| %call = tail call signext i32 @memcmp(ptr %buffer1, ptr %buffer2, i64 65) | ||
| %call = tail call signext i32 @memcmp(ptr %buffer1, ptr %buffer2, i64 165) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't 129 be the equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes . I think I can change to 129
| } | ||
|
|
||
| // Optimization: Fold i128 equality/inequality compares of two loads into a | ||
| // vectorized compare using vcmpequb.p when VSX is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say something about this gets inline memcmp.
| // as v16i8 vectors and use the Altivec/VSX vcmpequb.p instruction to | ||
| // perform a full 128-bit equality check in a single vector compare. | ||
|
|
||
| if (Subtarget.hasVSX()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vcmpequb is an Altivec vector instruction, not VSX.
| // as written to ensure correct program behavior, so we return an empty | ||
| // SDValue to indicate no action. | ||
| if (LA->isVolatile() || LB->isVolatile()) | ||
| return DAGCombineTruncBoolExt(N, DCI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return pattern is going to make it hard to further modify the combineSetCC function. I think this code should be outlined to a separate function.
| LB->getBasePtr(), LB->getMemOperand()); | ||
|
|
||
| SDValue IntrID = | ||
| DAG.getTargetConstant(Intrinsic::ppc_altivec_vcmpequb_p, DL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just use getConstant.
| ; CHECK-NEXT: mfocrf 3, 2 | ||
| ; CHECK-NEXT: rlwinm 3, 3, 25, 31, 31 | ||
| ; CHECK-NEXT: cntlzw 3, 3 | ||
| ; CHECK-NEXT: srwi 3, 3, 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra instruction? I think isolating and flipping the bit can just be rlwinm/xori.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the patch , we just make the
#include <memory.h>
int cmp16(const void *a, const void *b)
{
return memcmp(a, b, 16) == 0;
}
equal to
#include <altivec.h>
bool cmpeq16_2(const void *a, const void *b)
{
const vector unsigned char va = vec_xl(0, (unsigned char *)a);
const vector unsigned char vb = vec_xl(0, (unsigned char *)b);
return vec_all_eq(va, vb);
}
that is
Following code transforms the DAG
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t3: i128,ch = load<(load (s128) from %ir.a, align 1)> t0, t2,
undef:i64 t4: i64,ch = CopyFromReg t0, Register:i64 %1 t5: i128,ch =
load<(load (s128) from %ir.b, align 1)> t0, t4, undef:i64 t6: i1 =
setcc t3, t5, setne:ch
---->
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t3: v16i8,ch = load<(load (s128) from %ir.a, align 1)> t0, t2,
undef:i64 t4: i64,ch = CopyFromReg t0, Register:i64 %1 t5: v16i8,ch =
load<(load (s128) from %ir.b, align 1)> t0, t4, undef:i64 t6: i32 =
llvm.ppc.altivec.vcmpequb.p TargetConstant:i32<10505>,
Constant:i32<2>, t3, t5 t7: i1 = setcc t6, Constant:i32<0>, seteq:ch
I think we can have another patch to let
llvm.ppc.altivec.vcmpequb.p TargetConstant:i32<10505>,
Constant:i32<2>, t3, t5 t7: i1 = setcc t6, Constant:i32<0>, seteq:ch
convert to your instructions.
| ; CHECK-NEXT: xor 4, 4, 5 | ||
| ; CHECK-NEXT: or 3, 3, 4 | ||
| ; CHECK-NEXT: cntlzd 3, 3 | ||
| ; CHECK-NEXT: rldicl 3, 3, 58, 63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not change this sequence? It seems like a side effect and I'm not sure it's faster or slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we change the
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
the with 16 bytes support, it change the IR which generated by MergeICmpsLegacyPass (https://github.com/XRPLF/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MergeICmps.cpp#L840) to
t5: i128,ch = load<(load (s128) from %ir.X, align 1)> t0, t2, undef:i64
t8: i1 = setcc Constant:i128<237684487579686500932345921536>, t5, setne:ch
I think we can have another to optimize patch further and make the asm as load vector from memory and compared 2 two vector.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
The patch add 16 bytes load size for function PPCTTIImpl::enableMemCmpExpansion and fold i128 equality/inequality compares of two loads into a vectorized compare using vcmpequb.p when Altivec is available.
Rationale:
A scalar i128 SETCC (eq/ne) normally lowers to multiple scalar ops. On VSX-capable subtargets, we can instead reinterpret the i128 loads as v16i8 vectors and use the Altive vcmpequb.p instruction to perform a full 128-bit equality check in a single vector compare.
Example Result:
This transformation replaces memcmp(a, b, 16) with two vector loads and one vector compare instruction.