-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[NVPTX] Change the alloca address space in NVPTXLowerAlloca #154814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
thetheodor
wants to merge
19
commits into
llvm:main
Choose a base branch
from
thetheodor:nvptx_alloca_local_address_space
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 10 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
095073e
[NVPTX] Change the alloca address space in NVPTXLowerAlloca
thetheodor 7eb8b83
format
thetheodor 1c642db
Function -> Module Pass
thetheodor dd55413
assert -> fatal_error
thetheodor 94ee8c6
Remove SP
thetheodor b37221c
Fix pointer size query
thetheodor 4a15fd4
[clang] modify backend datalayout check
thetheodor edcfce7
Address comments
thetheodor b263b5b
Fix Debug Info updating
thetheodor 14df7dc
Remove NVPTXISD::DYNAMIC_STACKALLOC
thetheodor 344de62
Update NVPTXPeephole to eliminate cvta.locals
thetheodor 0a5b4d2
Address comments
thetheodor 24997be
Handle Data Layout better
thetheodor 2072fa9
Update OpenMP tests
thetheodor e51c2d5
Revert "Update OpenMP tests"
thetheodor 959bda4
Revert "Handle Data Layout better"
thetheodor adce270
Do not change the DL
thetheodor d8f1be9
Fix failing test
thetheodor cc46e1c
Merge branch 'main' into nvptx_alloca_local_address_space
thetheodor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several tests were failing because of this. clang checks if the module's layout is the same as the original.
I modified this to use
isCompatibleDataLayoutbut I'm not sure if this a good solutionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the failures? What exactly the DL strings that failed to match? Where did they come from? I think in general there should be no mismatch between clang and LLVM regarding what they consider to be the right data layout for the module. If they disagree, we need to figure out why and fix that.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
it's failing because the
-A5part (added byNVPTXLowerAlloca) was not part of the initial data layoutThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, the issue was that the datalayout in
clang/lib/Basic/Targets/NVPTX.cppwas not updatedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The side effect of this is that multiple (very long) OpenMP test had to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypothetically, I'd expect to see cases where existing code will bail out when it sees non-default AS, and the code that should bail out, but does not, because it does not bother to check whether the AS is non-default. Some of those will be caught by assertions, but some will happen to work and will remain silent. That's one of the reasons I'm somewhat reluctant about introducing AS on allocas early on -- the potential problem surface is pretty much all LLVM passes.
Unless there's a clear benefit for NVPTX back-end users that's worth the risk, and the the extra ASCs to go with the allocas through all the passes, I'd rather settler for a bit of extra complexity localized to the NVPTX back-end.
Stated intent of "produce simpler IR for allocas" alone just does not quite reach the bar of being worthwhile, IMO.
Allocas that remain in the PTX as always bad news for performance, so an extra address conversion instruction usually lost in the noise, and does not matter at all. I do not see much of a practical benefit even in the best case for this patch from the performance standpoint.
So far it looks like a wash to me. We may end up with potentially simpler/cleaner lowering for the allocase (with minimal/no benefit to the actual performance), but pay for it with an increased risk of unintended side effects (moderate? touches everything that touches allocas in all the passes), an incremental bump to the IR size to be processed by all the passes (granted, the impact of a few extra ASCs on compile time is probably not measurable in practice, but it's non-zero), and the fair amount of churn for the existing tests (not a showstopper, just a lot of mechanical changes).
Is it really worth doing/beneficial? What's the best case outcome we can expect from the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt there will be many places where existing LLVM passes are mishandling allocas in a specific AS. The fact that AMDGPU is already using a specific AS for allocas makes me think that the support for this feature is already reasonably good. There might be some edge cases somewhere which need to be fixed, but overall think these hypothetical bugs should not be a major factor in considering which approach to choose.
I'd still lean towards switching to local allocas. This seems to me like it provides a more accurate representation of the machine in the LLVM IR. While the IR might be bigger when initially emitted from clang or unoptimized, once
InferAddressSpaceis run, it will be smaller with specific allocas since we'll no longer need to wrap every generic alloca in a cast to it's true address space. We should probably consider movingInferAddressSpaceearlier to eliminate the size issue and this would have additional benefits such as improving subsequent alias-analysis compile-time.In general, this seems like this change allows us to eliminate a lot of hacks and work-arounds from the backend, in some cases improving the quality of emitted IR (I agree these quality improvements seem very marginal but I think it's still a win and there may be cases where they do make a difference). There are definitely some switching costs, and I'm not sure how best to handle the transition, but the final destination seems preferable even if it's not a game-changer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Agreed.
If we explicitly specify AS for local variables, then we should be doing that explicitly for other variables as well. AS knowledge does not benefit generic LLVM passes and is required for the NVPTX purposes only. I'm still not convinced that the proposed change buys us anything useful, other than DL being nominally closer to what the back-end needs.
In that case, InferAddressSpace should also be able to convert existing allocas to the correct AS without changing the default AS. We can infer AS for the newly materialized allocas with a late pass of InferAddressSpace.
So, it looks like the usefulness of the patch boils down to what to do about the allocas materialized throughout the compilation pipeline. We have the following scenarios:
My issues with (b) are mainly the invasiveness of DL change, and the fact that if the bring the idea of setting DL in a way that reflects the back-end AS where the data would live, then the same argument should apply to other data, not just local variables. It would benefit AA, but it does not look like something we should be forcing on the users. I think it's something that belongs under the compiler hood, IMO. No need to force users to do something compiler is quite capable of doing itself.
Perhaps we can have the cake and eat it here.
The fact that Data layout allows us to specify the default alloca AS does not mean that we have to do it that way. In practice, we'll still need to allow user IR to use default AS for allocas, and we will need to run
InferAddressSpaceat some point. I'm fine doing it early. It leaves us with the question of the AS for the new allocas. I wonder whether we could have a parallel hint for the default AS. If the DL specifies it, use DL-specified one. If DL says nothing, check with the target. We'll still need to run InferAddressSpace once more before lowering, but we should be able to reap most of the AA benefits with no churn for the user.This would also using (b) for experiments via explicit DL (and we can change the DL later if it proves to be the best way to handle it all), but it also avoids disrupting the existing users, and it gives us flexibility for how we handle allocas under the hood while we're sorting it out.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but LLVM already has support for allocas, whereas (afaik) for other types (e.g., shared variables) there is currently no support. For the former we simply need to change the DL and do a very simple autoupgrade of the input IR, whereas for globals/shared/etc the front-ends would have to emit different code (or we'd have to do a much more complicated autoupgrade?). But is this a good reason not to go ahead with setting the proper address space for allocas?
Maybe I'm misunderstanding something. But shouldn't autoupgrade take care of this? Users are welcome to specify the alloca address space, but they won't have to.
Would there be any benefit in not autoupgrading allocas to the local address space? Especially if we run InferAddressSpace early, wouldn't we more or less end up in the same spot?
One option would be to do this in
IRBuilder::CreateAllocaand check what the target wants if the alloca AS is unspecified (i.e., is zero). But I have some concerns with this:--nvptx-short-ptr.-A5and-A0DLs, then we'll have to identify, re-implement/fix, and test anything that relies on the DL's alloca AS.-A5, all allocas are expected to be in the local address space.-A0) we rely onInferAddressSpaces(potentially run early) to fix the existing allocas and either we rely on another lateInferAddressSpacesor, if possible, we modifyCreateAlloca(or something similar) to consult the Target when creating allocas. My question is how are these two scenarios different from the users' perspective? For example, clang will automatically set the correct DL to compile a .cu file and for existing IR we can simply autoupgrade it. Am I missing something? 🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your proposal is workable, I just do not see enough concrete benefits to justify the churn of changing DL. As things stand, it mostly shuffles ASCs around space and time, making few things easier, and few things a bit more cumbersome. If there are some concrete examples of this change paving the way for something with measurable benefits, I would be delighted to revise my assessment.
That said, autoupgrade (can we autoupgrade DL? I never tried.) is probably going to mitigate the hassle for the current users that use current DL and generate AS0-based allocas, so it's not going to make things that much worse, either.
If it was a small local change, I'd just stamp it and moved on. But for a moderately large patch touching a dozen files, changing DL, requiring autoupgrade, it's just not worth it, IMO.
So, I'm still not convinced that we need this change, but if there's plausible evidence that the change will be useful, or of someone else thinks that we want or need this change, I'll be OK with that.
To make it more concrete, let's start with the stated goal:
In all the tests in the patch I see only two instances of
cvta.local.u64eliminated in llvm/test/CodeGen/NVPTX/lower-byval-args.ll, and it's hard to tell whether the patch just shifted stuff around sufficiently to allow LLVM eliminate unnecessary writes (I think this particular problem had some notable recent improvements) or if it directly contributed to the better code. If you could add some tests where the patch clearly removes more cvta instructions from PTX, that would help.