Skip to content
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions clang/lib/CodeGen/BackendUtil.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1444,15 +1444,16 @@ void clang::emitBackendOutput(CompilerInstance &CI, CodeGenOptions &CGOpts,

// Verify clang's TargetInfo DataLayout against the LLVM TargetMachine's
// DataLayout.
if (AsmHelper.TM) {
std::string DLDesc = M->getDataLayout().getStringRepresentation();
if (DLDesc != TDesc) {
if (AsmHelper.TM)
if (!AsmHelper.TM->isCompatibleDataLayout(M->getDataLayout()) ||
!AsmHelper.TM->isCompatibleDataLayout(DataLayout(TDesc))) {
std::string DLDesc = M->getDataLayout().getStringRepresentation();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several tests were failing because of this. clang checks if the module's layout is the same as the original.

I modified this to use isCompatibleDataLayout but I'm not sure if this a good solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the failures? What exactly the DL strings that failed to match? Where did they come from? I think in general there should be no mismatch between clang and LLVM regarding what they consider to be the right data layout for the module. If they disagree, we need to figure out why and fix that.

Copy link
Contributor Author

@thetheodor thetheodor Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

/usr/bin/python3 /work/llvm-project/build/./bin/llvm-lit -vv /work/llvm-project/clang/test/CodeGenCUDA/bf16.cu                                                 
...
error: backend data layout 'e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64-A5' does not match expected target description 'e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64'
1 error generated when compiling for .

it's failing because the -A5 part (added by NVPTXLowerAlloca) was not part of the initial data layout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, the issue was that the datalayout in clang/lib/Basic/Targets/NVPTX.cpp was not updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect of this is that multiple (very long) OpenMP test had to be updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to completely update everywhere to use A5. This isn't totally new ground since this seems to be what AMDGPU does but it is a really big change that I'm not sure what all the implications would be

Hypothetically, I'd expect to see cases where existing code will bail out when it sees non-default AS, and the code that should bail out, but does not, because it does not bother to check whether the AS is non-default. Some of those will be caught by assertions, but some will happen to work and will remain silent. That's one of the reasons I'm somewhat reluctant about introducing AS on allocas early on -- the potential problem surface is pretty much all LLVM passes.
Unless there's a clear benefit for NVPTX back-end users that's worth the risk, and the the extra ASCs to go with the allocas through all the passes, I'd rather settler for a bit of extra complexity localized to the NVPTX back-end.

Stated intent of "produce simpler IR for allocas" alone just does not quite reach the bar of being worthwhile, IMO.

Allocas that remain in the PTX as always bad news for performance, so an extra address conversion instruction usually lost in the noise, and does not matter at all. I do not see much of a practical benefit even in the best case for this patch from the performance standpoint.

So far it looks like a wash to me. We may end up with potentially simpler/cleaner lowering for the allocase (with minimal/no benefit to the actual performance), but pay for it with an increased risk of unintended side effects (moderate? touches everything that touches allocas in all the passes), an incremental bump to the IR size to be processed by all the passes (granted, the impact of a few extra ASCs on compile time is probably not measurable in practice, but it's non-zero), and the fair amount of churn for the existing tests (not a showstopper, just a lot of mechanical changes).

Is it really worth doing/beneficial? What's the best case outcome we can expect from the patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically, I'd expect to see cases where existing code will bail out when it sees non-default AS, and the code that should bail out, but does not, because it does not bother to check whether the AS is non-default. Some of those will be caught by assertions, but some will happen to work and will remain silent.

I doubt there will be many places where existing LLVM passes are mishandling allocas in a specific AS. The fact that AMDGPU is already using a specific AS for allocas makes me think that the support for this feature is already reasonably good. There might be some edge cases somewhere which need to be fixed, but overall think these hypothetical bugs should not be a major factor in considering which approach to choose.

Is it really worth doing/beneficial? What's the best case outcome we can expect from the patch?

I'd still lean towards switching to local allocas. This seems to me like it provides a more accurate representation of the machine in the LLVM IR. While the IR might be bigger when initially emitted from clang or unoptimized, once InferAddressSpace is run, it will be smaller with specific allocas since we'll no longer need to wrap every generic alloca in a cast to it's true address space. We should probably consider moving InferAddressSpace earlier to eliminate the size issue and this would have additional benefits such as improving subsequent alias-analysis compile-time.

In general, this seems like this change allows us to eliminate a lot of hacks and work-arounds from the backend, in some cases improving the quality of emitted IR (I agree these quality improvements seem very marginal but I think it's still a win and there may be cases where they do make a difference). There are definitely some switching costs, and I'm not sure how best to handle the transition, but the final destination seems preferable even if it's not a game-changer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU is already using a specific AS for allocas makes me think that the support for this feature is already reasonably good. There might be some edge cases somewhere which need to be fixed, but overall think these hypothetical bugs should not be a major factor in considering which approach to choose.

Fair enough. Agreed.

I'd still lean towards switching to local allocas. This seems to me like it provides a more accurate representation of the machine in the LLVM IR.

If we explicitly specify AS for local variables, then we should be doing that explicitly for other variables as well. AS knowledge does not benefit generic LLVM passes and is required for the NVPTX purposes only. I'm still not convinced that the proposed change buys us anything useful, other than DL being nominally closer to what the back-end needs.

While the IR might be bigger when initially emitted from clang or unoptimized, once InferAddressSpace is run, it will be smaller with specific allocas since we'll no longer need to wrap every generic alloca in a cast to it's true address space.

In that case, InferAddressSpace should also be able to convert existing allocas to the correct AS without changing the default AS. We can infer AS for the newly materialized allocas with a late pass of InferAddressSpace.

So, it looks like the usefulness of the patch boils down to what to do about the allocas materialized throughout the compilation pipeline. We have the following scenarios:

  • a) Current implementation, before the patch: All allocas start in AS0, and inferred to be local by a late InferAddressSpace pass.
  • b) Proposed implementation: Makes the default AS for allocas to be local. Allows all allocas to be materialized with the correct AS. DL changes require all users to update the IR they generate (we'll presumably auto-upgrade IR with the pre-patch DL), and after the early run of InferAddressSpace will eliminate the redundant ASCs back to generic AS.
  • c) Half-way proposal. Run InferAddressSpace early to give existing allocas correct AS, run another InferAddressSpace late in the pipeline to catch newly materialized generic allocas. It gives us some of the alias analysis benefits of the approach above, but without the disruption of changing DL. Effectiveness of this approach will be better than the status quo, but less than the change of the default alloca AS to local. By new, I mean the allocas that are not the result of splitting/trimming the existing alloca, as in such cases I would assume the AS to be inherited from the old alloca, which would preserve the local AS. I do not have a good idea of what's a typical ratio of pre-existing allocas vs the new allocas materialized by compiler. If new allocas are rare, then the AA effectiveness will asymptotically approach that of the (b).

My issues with (b) are mainly the invasiveness of DL change, and the fact that if the bring the idea of setting DL in a way that reflects the back-end AS where the data would live, then the same argument should apply to other data, not just local variables. It would benefit AA, but it does not look like something we should be forcing on the users. I think it's something that belongs under the compiler hood, IMO. No need to force users to do something compiler is quite capable of doing itself.

Perhaps we can have the cake and eat it here.
The fact that Data layout allows us to specify the default alloca AS does not mean that we have to do it that way. In practice, we'll still need to allow user IR to use default AS for allocas, and we will need to run InferAddressSpace at some point. I'm fine doing it early. It leaves us with the question of the AS for the new allocas. I wonder whether we could have a parallel hint for the default AS. If the DL specifies it, use DL-specified one. If DL says nothing, check with the target. We'll still need to run InferAddressSpace once more before lowering, but we should be able to reap most of the AA benefits with no churn for the user.
This would also using (b) for experiments via explicit DL (and we can change the DL later if it proves to be the best way to handle it all), but it also avoids disrupting the existing users, and it gives us flexibility for how we handle allocas under the hood while we're sorting it out.

Does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issues with (b) are mainly the invasiveness of DL change, and the fact that if the bring the idea of setting DL in a way that reflects the back-end AS where the data would live, then the same argument should apply to other data, not just local variables.

I agree, but LLVM already has support for allocas, whereas (afaik) for other types (e.g., shared variables) there is currently no support. For the former we simply need to change the DL and do a very simple autoupgrade of the input IR, whereas for globals/shared/etc the front-ends would have to emit different code (or we'd have to do a much more complicated autoupgrade?). But is this a good reason not to go ahead with setting the proper address space for allocas?

No need to force users to do something compiler is quite capable of doing itself.

Maybe I'm misunderstanding something. But shouldn't autoupgrade take care of this? Users are welcome to specify the alloca address space, but they won't have to.

The fact that Data layout allows us to specify the default alloca AS does not mean that we have to do it that way. In practice, we'll still need to allow user IR to use default AS for allocas, and we will need to run InferAddressSpace at some point.

Would there be any benefit in not autoupgrading allocas to the local address space? Especially if we run InferAddressSpace early, wouldn't we more or less end up in the same spot?

It leaves us with the question of the AS for the new allocas. I wonder whether we could have a parallel hint for the default AS. If the DL specifies it, use DL-specified one. If DL says nothing, check with the target.

One option would be to do this in IRBuilder::CreateAlloca and check what the target wants if the alloca AS is unspecified (i.e., is zero). But I have some concerns with this:

  • If we want to support both DLs (with or without specified alloca AS) then we need some way to construct two different layouts. Maybe this can be handled similarly to --nvptx-short-ptr.
  • Would this not introduce a larger surface for bugs or divergent behavior? If the NVPTX target must work with both -A5 and -A0 DLs, then we'll have to identify, re-implement/fix, and test anything that relies on the DL's alloca AS.
  • Maybe I'm misunderstanding the proposal, but the main idea is that we support two scenarios:
    • If the DL specifies -A5, all allocas are expected to be in the local address space.
    • If the DL does not specify the alloca address space (or specifies -A0) we rely on InferAddressSpaces (potentially run early) to fix the existing allocas and either we rely on another late InferAddressSpaces or, if possible, we modify CreateAlloca (or something similar) to consult the Target when creating allocas. My question is how are these two scenarios different from the users' perspective? For example, clang will automatically set the correct DL to compile a .cu file and for existing IR we can simply autoupgrade it. Am I missing something? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your proposal is workable, I just do not see enough concrete benefits to justify the churn of changing DL. As things stand, it mostly shuffles ASCs around space and time, making few things easier, and few things a bit more cumbersome. If there are some concrete examples of this change paving the way for something with measurable benefits, I would be delighted to revise my assessment.

That said, autoupgrade (can we autoupgrade DL? I never tried.) is probably going to mitigate the hassle for the current users that use current DL and generate AS0-based allocas, so it's not going to make things that much worse, either.

If it was a small local change, I'd just stamp it and moved on. But for a moderately large patch touching a dozen files, changing DL, requiring autoupgrade, it's just not worth it, IMO.

So, I'm still not convinced that we need this change, but if there's plausible evidence that the change will be useful, or of someone else thinks that we want or need this change, I'll be OK with that.

To make it more concrete, let's start with the stated goal:

This change results in fewer address-space-change (ctva) instructions in the final PTX.

In all the tests in the patch I see only two instances of cvta.local.u64 eliminated in llvm/test/CodeGen/NVPTX/lower-byval-args.ll, and it's hard to tell whether the patch just shifted stuff around sufficiently to allow LLVM eliminate unnecessary writes (I think this particular problem had some notable recent improvements) or if it directly contributed to the better code. If you could add some tests where the patch clearly removes more cvta instructions from PTX, that would help.

unsigned DiagID = Diags.getCustomDiagID(
DiagnosticsEngine::Error, "backend data layout '%0' does not match "
"expected target description '%1'");
DiagnosticsEngine::Error,
"backend data layout '%0' is not compatible with "
"expected target description '%1'");
Diags.Report(DiagID) << DLDesc << TDesc;
}
}
}

// With -fembed-bitcode, save a copy of the llvm IR as data in the
Expand Down
2 changes: 1 addition & 1 deletion llvm/include/llvm/Target/TargetMachine.h
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ class LLVM_ABI TargetMachine {
/// The LLVM Module owns a DataLayout that is used for the target independent
/// optimizations and code generation. This hook provides a target specific
/// check on the validity of this DataLayout.
bool isCompatibleDataLayout(const DataLayout &Candidate) const {
virtual bool isCompatibleDataLayout(const DataLayout &Candidate) const {
return DL == Candidate;
}

Expand Down
8 changes: 8 additions & 0 deletions llvm/lib/IR/Verifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@
#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/MathExtras.h"
#include "llvm/Support/ModRef.h"
#include "llvm/Support/NVPTXAddrSpace.h"
#include "llvm/Support/raw_ostream.h"
#include <algorithm>
#include <cassert>
Expand Down Expand Up @@ -4498,6 +4499,13 @@ void Verifier::visitAllocaInst(AllocaInst &AI) {
"alloca on amdgpu must be in addrspace(5)", &AI);
}

if (TT.isNVPTX()) {
Check(AI.getAddressSpace() == NVPTXAS::ADDRESS_SPACE_LOCAL ||
AI.getAddressSpace() == NVPTXAS::ADDRESS_SPACE_GENERIC,
"AllocaInst can only be in Generic or Local address space for NVPTX.",
&AI);
}

visitInstruction(AI);
}

Expand Down
2 changes: 1 addition & 1 deletion llvm/lib/Target/NVPTX/NVPTX.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ MachineFunctionPass *createNVPTXPrologEpilogPass();
MachineFunctionPass *createNVPTXReplaceImageHandlesPass();
FunctionPass *createNVPTXImageOptimizerPass();
FunctionPass *createNVPTXLowerArgsPass();
FunctionPass *createNVPTXLowerAllocaPass();
ModulePass *createNVPTXLowerAllocaPass();
FunctionPass *createNVPTXLowerUnreachablePass(bool TrapUnreachable,
bool NoTrapAfterNoreturn);
FunctionPass *createNVPTXTagInvariantLoadsPass();
Expand Down
15 changes: 7 additions & 8 deletions llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@
#include "llvm/Support/Compiler.h"
#include "llvm/Support/Endian.h"
#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/NVPTXAddrSpace.h"
#include "llvm/Support/NativeFormatting.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetLoweringObjectFile.h"
Expand Down Expand Up @@ -1483,14 +1484,12 @@ void NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters(
int64_t NumBytes = MFI.getStackSize();
if (NumBytes) {
O << "\t.local .align " << MFI.getMaxAlign().value() << " .b8 \t"
<< DEPOTNAME << getFunctionNumber() << "[" << NumBytes << "];\n";
if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {
O << "\t.reg .b64 \t%SP;\n"
<< "\t.reg .b64 \t%SPL;\n";
} else {
O << "\t.reg .b32 \t%SP;\n"
<< "\t.reg .b32 \t%SPL;\n";
}
<< DEPOTNAME << getFunctionNumber() << "[" << NumBytes << "];\n"
<< "\t.reg .b"
<< MF.getTarget().getPointerSizeInBits(ADDRESS_SPACE_GENERIC)
<< " \t%SP;\n"
<< "\t.reg .b" << MF.getTarget().getPointerSizeInBits(ADDRESS_SPACE_LOCAL)
<< " \t%SPL;\n";
}

// Go through all virtual registers to establish the mapping between the
Expand Down
52 changes: 36 additions & 16 deletions llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,25 +48,45 @@ void NVPTXFrameLowering::emitPrologue(MachineFunction &MF,
// mov %SPL, %depot;
// cvta.local %SP, %SPL;
// for local address accesses in MF.
bool Is64Bit =
static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit();
// if the generic and local address spaces are different,
// it emits:
// mov %SPL, %depot;
// cvt.u64.u32 %SP, %SPL;
// cvta.local %SP, %SP;

if (MR.use_empty(NRI->getFrameLocalRegister(MF)))
// If %SPL is not used, do not bother emitting anything
return;
bool IsLocal64Bit =
MF.getTarget().getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8;
bool IsGeneric64Bit =
MF.getTarget().getPointerSize(NVPTXAS::ADDRESS_SPACE_GENERIC) == 8;
bool NeedsCast = IsGeneric64Bit != IsLocal64Bit;
Register SourceReg = NRI->getFrameLocalRegister(MF);
if (NeedsCast)
SourceReg = NRI->getFrameRegister(MF);

unsigned CvtaLocalOpcode =
(Is64Bit ? NVPTX::cvta_local_64 : NVPTX::cvta_local);
unsigned MovDepotOpcode =
(Is64Bit ? NVPTX::MOV_DEPOT_ADDR_64 : NVPTX::MOV_DEPOT_ADDR);
if (!MR.use_empty(NRI->getFrameRegister(MF))) {
// If %SP is not used, do not bother emitting "cvta.local %SP, %SPL".
(IsGeneric64Bit ? NVPTX::cvta_local_64 : NVPTX::cvta_local);

MBBI = BuildMI(MBB, MBBI, dl,
MF.getSubtarget().getInstrInfo()->get(CvtaLocalOpcode),
NRI->getFrameRegister(MF))
.addReg(SourceReg);

if (NeedsCast)
MBBI = BuildMI(MBB, MBBI, dl,
MF.getSubtarget().getInstrInfo()->get(CvtaLocalOpcode),
MF.getSubtarget().getInstrInfo()->get(NVPTX::CVT_u64_u32),
NRI->getFrameRegister(MF))
.addReg(NRI->getFrameLocalRegister(MF));
}
if (!MR.use_empty(NRI->getFrameLocalRegister(MF))) {
BuildMI(MBB, MBBI, dl,
MF.getSubtarget().getInstrInfo()->get(MovDepotOpcode),
NRI->getFrameLocalRegister(MF))
.addImm(MF.getFunctionNumber());
}
.addReg(NRI->getFrameLocalRegister(MF))
.addImm(NVPTX::PTXCvtMode::NONE);

unsigned MovDepotOpcode =
(IsLocal64Bit ? NVPTX::MOV_DEPOT_ADDR_64 : NVPTX::MOV_DEPOT_ADDR);
BuildMI(MBB, MBBI, dl,
MF.getSubtarget().getInstrInfo()->get(MovDepotOpcode),
NRI->getFrameLocalRegister(MF))
.addImm(MF.getFunctionNumber());
}
}

Expand Down
26 changes: 1 addition & 25 deletions llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1106,7 +1106,6 @@ const char *NVPTXTargetLowering::getTargetNodeName(unsigned Opcode) const {
MAKE_CASE(NVPTXISD::FMINNUM3)
MAKE_CASE(NVPTXISD::FMAXIMUM3)
MAKE_CASE(NVPTXISD::FMINIMUM3)
MAKE_CASE(NVPTXISD::DYNAMIC_STACKALLOC)
MAKE_CASE(NVPTXISD::STACKRESTORE)
MAKE_CASE(NVPTXISD::STACKSAVE)
MAKE_CASE(NVPTXISD::SETP_F16X2)
Expand Down Expand Up @@ -1771,10 +1770,8 @@ SDValue NVPTXTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,

SDValue NVPTXTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op,
SelectionDAG &DAG) const {

if (STI.getPTXVersion() < 73 || STI.getSmVersion() < 52) {
const Function &Fn = DAG.getMachineFunction().getFunction();

DAG.getContext()->diagnose(DiagnosticInfoUnsupported(
Fn,
"Support for dynamic alloca introduced in PTX ISA version 7.3 and "
Expand All @@ -1785,28 +1782,7 @@ SDValue NVPTXTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op,
return DAG.getMergeValues(Ops, SDLoc());
}

SDLoc DL(Op.getNode());
SDValue Chain = Op.getOperand(0);
SDValue Size = Op.getOperand(1);
uint64_t Align = Op.getConstantOperandVal(2);

// The alignment on a ISD::DYNAMIC_STACKALLOC node may be 0 to indicate that
// the default stack alignment should be used.
if (Align == 0)
Align = DAG.getSubtarget().getFrameLowering()->getStackAlign().value();

// The size for ptx alloca instruction is 64-bit for m64 and 32-bit for m32.
const MVT LocalVT = getPointerTy(DAG.getDataLayout(), ADDRESS_SPACE_LOCAL);

SDValue Alloc =
DAG.getNode(NVPTXISD::DYNAMIC_STACKALLOC, DL, {LocalVT, MVT::Other},
{Chain, DAG.getZExtOrTrunc(Size, DL, LocalVT),
DAG.getTargetConstant(Align, DL, MVT::i32)});

SDValue ASC = DAG.getAddrSpaceCast(
DL, Op.getValueType(), Alloc, ADDRESS_SPACE_LOCAL, ADDRESS_SPACE_GENERIC);

return DAG.getMergeValues({ASC, SDValue(Alloc.getNode(), 1)}, DL);
return Op;
}

SDValue NVPTXTargetLowering::LowerSTACKRESTORE(SDValue Op,
Expand Down
1 change: 0 additions & 1 deletion llvm/lib/Target/NVPTX/NVPTXISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ enum NodeType : unsigned {
FMAXIMUM3,
FMINIMUM3,

DYNAMIC_STACKALLOC,
STACKRESTORE,
STACKSAVE,
BrxStart,
Expand Down
33 changes: 21 additions & 12 deletions llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
Original file line number Diff line number Diff line change
Expand Up @@ -2253,22 +2253,31 @@ def trapexitinst : NVPTXInst<(outs), (ins), "trap; exit;", [(trap)]>, Requires<[
// brkpt instruction
def debugtrapinst : BasicNVPTXInst<(outs), (ins), "brkpt", [(debugtrap)]>;

def SDTDynAllocaOp :
SDTypeProfile<1, 2, [SDTCisSameAs<0, 1>, SDTCisInt<1>, SDTCisVT<2, i32>]>;
def SDTDynAllocaOp
: SDTypeProfile<1, 2, [SDTCisSameAs<0, 1>, SDTCisInt<1>, SDTCisVT<2, i32>]>;

def dyn_alloca :
SDNode<"NVPTXISD::DYNAMIC_STACKALLOC", SDTDynAllocaOp,
[SDNPHasChain, SDNPSideEffect]>;
def getAllocaAlign : SDNodeXForm<imm, [{
if (N->getZExtValue() != 0)
return CurDAG->getTargetConstant(N->getZExtValue(), SDLoc(N), N->getValueType(0));
return CurDAG->getTargetConstant(CurDAG->getSubtarget().getFrameLowering()->getStackAlign().value(), SDLoc(N), N->getValueType(0));
}]>;

foreach t = [I32RT, I64RT] in {
def DYNAMIC_STACKALLOC # t.Size :
BasicNVPTXInst<(outs t.RC:$ptr),
(ins t.RC:$size, i32imm:$align),
"alloca.u" # t.Size,
[(set t.Ty:$ptr, (dyn_alloca t.Ty:$size, timm:$align))]>,
Requires<[hasPTX<73>, hasSM<52>]>;
def dyn_alloca : SDNode<"ISD::DYNAMIC_STACKALLOC",
SDTDynAllocaOp, [SDNPHasChain, SDNPSideEffect]>;

let Predicates = [hasPTX<73>, hasSM<52>] in {
foreach t = [I32RT, I64RT] in {
def DYNAMIC_STACKALLOC_#t.Size
: BasicNVPTXInst<(outs t.RC:$ptr), (ins t.RC:$size, i32imm:$align),
"alloca.u"#t.Size>;
}
}

def : Pat<(i32(dyn_alloca i32:$size, imm:$align)),
(DYNAMIC_STACKALLOC_32 $size, (getAllocaAlign imm:$align))>;
def : Pat<(i64(dyn_alloca i64:$size, imm:$align)),
(DYNAMIC_STACKALLOC_64 $size, (getAllocaAlign imm:$align))>;

//
// BRX
//
Expand Down
Loading
Loading