-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Enable kernarg preloading by default on gfx940 #110691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1017,15 +1017,52 @@ struct AAAMDGPUNoAGPR | |
|
|
||
| const char AAAMDGPUNoAGPR::ID = 0; | ||
|
|
||
| static unsigned getMaxNumPreloadArgs(const Function &F, const DataLayout &DL, | ||
| const GCNSubtarget &ST) { | ||
| unsigned Offset = 0; | ||
| unsigned ArgsToPreload = 0; | ||
| for (const auto &Arg : F.args()) { | ||
| if (Arg.hasByRefAttr()) | ||
| break; | ||
|
|
||
| Type *Ty = Arg.getType(); | ||
| Align ArgAlign = DL.getABITypeAlign(Ty); | ||
| auto Size = DL.getTypeAllocSize(Ty); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The allocation size doesn't directly map to number of registers. Really this should look at what ComputeValueVTs does to get the number of registers. I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the alloc size is correct with respect to the number of user SGPRs that are initially allocated for preloading, since it maps directly on to how the data looks in the kernarg segment, and the number of registers used for preloading is derived from that.
I short of agree, but we have to decide what to do when frontends like Triton just add inreg to every argument. Should we remove it in cases where we cannot preload the argument? Print a warning if we cannot preload? I'm leaning towards the first option where we remove inreg from arguments that wont actually be preloaded somewhere like AMDGPULowerKernelArguments after all the attributes are finalized, ect.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's only correct for trivially legal types. It will not be for anything exotic or aggregates. You either need to use ComputeValueVTs or only accept trivial types that map to N registers
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What we are calculating is the size in memory, i.e. the start offset to the ending explicit offset. Isn't this method correct for anything that is not byref? Do you think we should also consider the attributes that are inferred by the AMDGPUAttributor itself when marking things for preloading here? I.e. add inreg at the very end?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally kernel arguments would only use byref (except for these inreg arguments). Non-byref arguments are really byref, and will be lowered to byref. If you're making the values inreg, they should be treated more like ordinary calling convention register arguments. It's probably simplest to just filter out types that aren't trivially register types. That's the case for all implicit kernel arguments anyway. |
||
| Offset = alignTo(Offset, ArgAlign); | ||
| if (((Offset + Size) / 4) > ST.getMaxNumUserSGPRs()) | ||
| break; | ||
|
|
||
| Offset += Size; | ||
| ArgsToPreload++; | ||
| } | ||
|
|
||
| return ArgsToPreload; | ||
| } | ||
|
|
||
| static void addPreloadKernArgHint(Function &F, TargetMachine &TM) { | ||
| const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F); | ||
| for (unsigned I = 0; | ||
| I < F.arg_size() && | ||
| I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs()); | ||
| ++I) { | ||
| if (!ST.hasKernargPreload()) | ||
| return; | ||
|
|
||
| // Enable kernarg preloading by default on GFX940+. | ||
| size_t PreloadCount; | ||
| if (KernargPreloadCount.getNumOccurrences()) { | ||
| // Override default behavior if CL option is present. | ||
| PreloadCount = std::min<size_t>(KernargPreloadCount, F.arg_size()); | ||
| } else { | ||
| // Defaults with no CL option. | ||
| if (ST.defaultEnabledKernargPreload()) | ||
| PreloadCount = | ||
| getMaxNumPreloadArgs(F, F.getParent()->getDataLayout(), ST); | ||
| else | ||
| PreloadCount = 0; | ||
| } | ||
|
|
||
| for (size_t I = 0; I < PreloadCount; ++I) { | ||
| Argument &Arg = *F.getArg(I); | ||
| // Check for incompatible attributes. | ||
| if (Arg.hasByRefAttr() || Arg.hasNestAttr()) | ||
| if (Arg.getType()->isAggregateType() || Arg.hasByRefAttr() || | ||
| Arg.hasNestAttr()) | ||
| break; | ||
|
|
||
| Arg.addAttr(Attribute::InReg); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up we really should be handling the byref case for the future