-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AArch64] Combine signext_inreg of setcc(... != splat(0)) #157665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels quite late in the pipeline if we're relying upon AArch64 ISD nodes.
When lowering ctz_v16i1 I see this in the debug output:
and there is a run of DAGCombiner immediately afterwards, which suggests that you can do this optimisation earlier and look for the SIGN_EXTEND_INREG node instead. In theory you should be able to make the codegen even better, essentially by doing:
That way you can get rid of the remaining shl instruction I think, which is also unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my original approach was to match the
SIGN_EXTEND_INREGitself, rather than the expansion, as you suggest. The issue here seems to be that for some/most of the cases this is not sufficient to trigger the transformation. In the@llvm.experimental.cttz.eltstests this works as you show above, but for example in the@llvm.masked.loadcasees we expand the sign_extend_inreg node BEFORE we expand the masked_load node, so the pattern we're trying to match doesn't exist:It seemed like matching the expansion of the
SIGN_EXTEND_INREGmade the most sense to catch all the cases, but maybe there's an alternative approach I'm not seeing that would still allow us to match theSIGN_EXTEND_INREG?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, I'll have a think about it some more then. It just feels like the sort of thing we ought to be fixing earlier on using generic ISD nodes. In reality the predicate could come from two different sources:
Ideally we'd be able to handle both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping that we can encourage all paths into a canonical form that requires a single DAG combine. Your PR may effectively be doing that, just at the very last DAG combiner pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I understand this a bit more now ... If I lower these two functions:
we actually end up with decent codegen for
masked_load_v16i8_2:so the problem is purely limited to the case where the predicate is an unknown live-in for the block. I see what you mean about the ordering of lowering for masked_load_v16i8, i.e. we first see
then
It feels like a shame we're expanding the sign_extend_inreg so early on. I wonder if a cleaner solution is to fold
t16: v16i8 = sign_extend_inreg t4, ValueType:ch:v16i1andt9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t16, t7into this:That would remove the extends completely and hopefully lead to better codegen too, since it will also remove the VSHL. Can we do this in the DAG combine phase after
Type-legalized selection DAG: %bb.0 'masked_load_v16i8:'. What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes that seems like a neat solution!
This will leave us not catching the CTTZ_ELTS case, but that can be handled separately - I think we'd still need something along the lines of
for that one. Unless we did something when we expand the cttz.elts intrinsic perhaps, although I'm not sure how feasible that is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore my previous comments as I'd completely forgotten that for types such as v4i1 the upper bits can be undefined. The reason for the sign_extend_inreg is to ensure the promoted type v4i8 is well-defined. In this case you can't remove the sign_extend_inreg, so I think the approach you have here is probably the most viable. However, one minor suggestion - instead of simply removing VASHR and leaving the VSHL, would it be better instead to replace both nodes with a simple vector AND of 0x1? We only care about:
I think on a few cores the vector AND has a higher throughput than SHL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes this is a nice approach and I can see that the throughput is better for AND on e.g. Neoverse V2 from the SWOG.
My one thought is - do we care about paying the cost of the additional
MOVI #1for the higher throughput AND? I think that on newer cores e.g. Neoverse V3 the throughput of SHL is equal to AND so maybe there we would favor just the SHL? Obviously there are more e.g. Neoverse V2 cores in the wild at present so I can see it would make sense to favor this, I'm just flagging this up!