-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[NVPTX] Change the alloca address space in NVPTXLowerAlloca #154814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[NVPTX] Change the alloca address space in NVPTXLowerAlloca #154814
Conversation
This patch refactors NVPTXLowerAlloca to produce simpler IR for allocas. Previously, the implementation attached a pair of consecutive address space casts to each alloca: one from addrspace(0) (generic) to addrspace(5) (local), and another immediately back to addrspace(0). Downstream passes needed to recognize this idiom to generate efficient PTX. With this patch, NVPTXLowerAlloca directly changes the address space of each alloca to "local" and inserts a single addrspacecast from local back to generic. The InferAddressSpace pass can then remove the remaining cast. This change results in fewer address-space-change (ctva) instructions in the final PTX.
|
@llvm/pr-subscribers-clang @llvm/pr-subscribers-llvm-ir Author: Theodoros Theodoridis (thetheodor) ChangesThis patch refactors NVPTXLowerAlloca to produce simpler IR for allocas. Previously, the implementation attached a pair of consecutive address space casts to each alloca: one from addrspace(0) (generic) to addrspace(5) (local), and another immediately back to addrspace(0). Downstream passes needed to recognize this idiom to generate efficient PTX. With this patch, NVPTXLowerAlloca directly changes the address space of each alloca to "local" and inserts a single addrspacecast from local back to generic. The InferAddressSpace pass can then remove the remaining cast. This change results in fewer address-space-change (ctva) instructions in the final PTX. Patch is 57.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154814.diff 22 Files Affected:
diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index bf4e490554723..f4cd06bf93d40 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -208,7 +208,7 @@ class LLVM_ABI TargetMachine {
/// The LLVM Module owns a DataLayout that is used for the target independent
/// optimizations and code generation. This hook provides a target specific
/// check on the validity of this DataLayout.
- bool isCompatibleDataLayout(const DataLayout &Candidate) const {
+ virtual bool isCompatibleDataLayout(const DataLayout &Candidate) const {
return DL == Candidate;
}
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 9d9b51db98702..2afa2bda53b52 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -114,6 +114,7 @@
#include "llvm/Pass.h"
#include "llvm/ProfileData/InstrProf.h"
#include "llvm/Support/AMDGPUAddrSpace.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
#include "llvm/Support/AtomicOrdering.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
@@ -4498,6 +4499,13 @@ void Verifier::visitAllocaInst(AllocaInst &AI) {
"alloca on amdgpu must be in addrspace(5)", &AI);
}
+ if (TT.isNVPTX()) {
+ Check(AI.getAddressSpace() == NVPTXAS::ADDRESS_SPACE_LOCAL ||
+ AI.getAddressSpace() == NVPTXAS::ADDRESS_SPACE_GENERIC,
+ "AllocaInst can only be in Generic or Local address space for NVPTX.",
+ &AI);
+ }
+
visitInstruction(AI);
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
index 7391c2d488b57..8528cb41d9244 100644
--- a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
@@ -1484,7 +1484,8 @@ void NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters(
if (NumBytes) {
O << "\t.local .align " << MFI.getMaxAlign().value() << " .b8 \t"
<< DEPOTNAME << getFunctionNumber() << "[" << NumBytes << "];\n";
- if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {
+ if (static_cast<const NVPTXTargetMachine &>(MF.getTarget())
+ .getPointerSize(ADDRESS_SPACE_LOCAL) == 8) {
O << "\t.reg .b64 \t%SP;\n"
<< "\t.reg .b64 \t%SPL;\n";
} else {
diff --git a/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
index 47bc15f52bb96..78f18a93b869b 100644
--- a/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
@@ -48,8 +48,8 @@ void NVPTXFrameLowering::emitPrologue(MachineFunction &MF,
// mov %SPL, %depot;
// cvta.local %SP, %SPL;
// for local address accesses in MF.
- bool Is64Bit =
- static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit();
+ bool Is64Bit = static_cast<const NVPTXTargetMachine &>(MF.getTarget())
+ .getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8;
unsigned CvtaLocalOpcode =
(Is64Bit ? NVPTX::cvta_local_64 : NVPTX::cvta_local);
unsigned MovDepotOpcode =
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index bb4bb1195f78b..f35407ed99106 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -1803,10 +1803,7 @@ SDValue NVPTXTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op,
{Chain, DAG.getZExtOrTrunc(Size, DL, LocalVT),
DAG.getTargetConstant(Align, DL, MVT::i32)});
- SDValue ASC = DAG.getAddrSpaceCast(
- DL, Op.getValueType(), Alloc, ADDRESS_SPACE_LOCAL, ADDRESS_SPACE_GENERIC);
-
- return DAG.getMergeValues({ASC, SDValue(Alloc.getNode(), 1)}, DL);
+ return Alloc;
}
SDValue NVPTXTargetLowering::LowerSTACKRESTORE(SDValue Op,
diff --git a/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp b/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
index 88bc000f39bf7..03412edb9e23c 100644
--- a/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
@@ -6,16 +6,16 @@
//
//===----------------------------------------------------------------------===//
//
-// For all alloca instructions, and add a pair of cast to local address for
-// each of them. For example,
+// Change the Module's DataLayout to have the local address space for alloca's.
+// Change the address space of each alloca to local and add an addrspacecast to
+// generic address space. For example,
//
// %A = alloca i32
// store i32 0, i32* %A ; emits st.u32
//
// will be transformed to
//
-// %A = alloca i32
-// %Local = addrspacecast i32* %A to i32 addrspace(5)*
+// %A = alloca i32, addrspace(5)
// %Generic = addrspacecast i32 addrspace(5)* %A to i32*
// store i32 0, i32 addrspace(5)* %Generic ; emits st.local.u32
//
@@ -24,18 +24,24 @@
//
//===----------------------------------------------------------------------===//
-#include "MCTargetDesc/NVPTXBaseInfo.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
#include "NVPTX.h"
+#include "llvm/ADT/SmallVector.h"
#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Module.h"
#include "llvm/IR/Type.h"
#include "llvm/Pass.h"
using namespace llvm;
+using namespace NVPTXAS;
namespace {
class NVPTXLowerAlloca : public FunctionPass {
bool runOnFunction(Function &F) override;
+ bool doInitialization(Module &M) override;
public:
static char ID; // Pass identification, replacement for typeid
@@ -58,77 +64,55 @@ bool NVPTXLowerAlloca::runOnFunction(Function &F) {
if (skipFunction(F))
return false;
- bool Changed = false;
+ SmallVector<AllocaInst *, 16> Allocas;
for (auto &BB : F)
- for (auto &I : BB) {
- if (auto allocaInst = dyn_cast<AllocaInst>(&I)) {
- Changed = true;
+ for (auto &I : BB)
+ if (auto *Alloca = dyn_cast<AllocaInst>(&I))
+ if (Alloca->getAddressSpace() != ADDRESS_SPACE_LOCAL)
+ Allocas.push_back(Alloca);
+
+ if (Allocas.empty())
+ return false;
- PointerType *AllocInstPtrTy =
- cast<PointerType>(allocaInst->getType()->getScalarType());
- unsigned AllocAddrSpace = AllocInstPtrTy->getAddressSpace();
- assert((AllocAddrSpace == ADDRESS_SPACE_GENERIC ||
- AllocAddrSpace == ADDRESS_SPACE_LOCAL) &&
- "AllocaInst can only be in Generic or Local address space for "
- "NVPTX.");
+ for (AllocaInst *Alloca : Allocas) {
+ auto *NewAlloca = new AllocaInst(
+ Alloca->getAllocatedType(), ADDRESS_SPACE_LOCAL, Alloca->getArraySize(),
+ Alloca->getAlign(), Alloca->getName());
+ auto *Cast = new AddrSpaceCastInst(
+ NewAlloca,
+ PointerType::get(Alloca->getAllocatedType()->getContext(),
+ ADDRESS_SPACE_GENERIC),
+ "");
+ Cast->insertBefore(Alloca->getIterator());
+ NewAlloca->insertBefore(Cast->getIterator());
+ for (auto &U : llvm::make_early_inc_range(Alloca->uses())) {
+ auto *II = dyn_cast<IntrinsicInst>(U.getUser());
+ if (!II || (II->getIntrinsicID() != Intrinsic::lifetime_start &&
+ II->getIntrinsicID() != Intrinsic::lifetime_end))
+ continue;
- Instruction *AllocaInLocalAS = allocaInst;
- auto ETy = allocaInst->getAllocatedType();
+ IRBuilder<> Builder(II);
+ Builder.CreateIntrinsic(II->getIntrinsicID(), {NewAlloca->getType()},
+ {NewAlloca});
+ II->eraseFromParent();
+ }
- // We need to make sure that LLVM has info that alloca needs to go to
- // ADDRESS_SPACE_LOCAL for InferAddressSpace pass.
- //
- // For allocas in ADDRESS_SPACE_LOCAL, we add addrspacecast to
- // ADDRESS_SPACE_LOCAL and back to ADDRESS_SPACE_GENERIC, so that
- // the alloca's users still use a generic pointer to operate on.
- //
- // For allocas already in ADDRESS_SPACE_LOCAL, we just need
- // addrspacecast to ADDRESS_SPACE_GENERIC.
- if (AllocAddrSpace == ADDRESS_SPACE_GENERIC) {
- auto ASCastToLocalAS = new AddrSpaceCastInst(
- allocaInst,
- PointerType::get(ETy->getContext(), ADDRESS_SPACE_LOCAL), "");
- ASCastToLocalAS->insertAfter(allocaInst->getIterator());
- AllocaInLocalAS = ASCastToLocalAS;
- }
+ Alloca->replaceAllUsesWith(Cast);
+ Alloca->eraseFromParent();
+ }
+ return true;
+}
- auto AllocaInGenericAS = new AddrSpaceCastInst(
- AllocaInLocalAS,
- PointerType::get(ETy->getContext(), ADDRESS_SPACE_GENERIC), "");
- AllocaInGenericAS->insertAfter(AllocaInLocalAS->getIterator());
+bool NVPTXLowerAlloca::doInitialization(Module &M) {
+ const auto &DL = M.getDataLayout();
+ if (DL.getAllocaAddrSpace() == ADDRESS_SPACE_LOCAL)
+ return false;
+ auto DLStr = DL.getStringRepresentation();
- for (Use &AllocaUse : llvm::make_early_inc_range(allocaInst->uses())) {
- // Check Load, Store, GEP, and BitCast Uses on alloca and make them
- // use the converted generic address, in order to expose non-generic
- // addrspacecast to NVPTXInferAddressSpaces. For other types
- // of instructions this is unnecessary and may introduce redundant
- // address cast.
- auto LI = dyn_cast<LoadInst>(AllocaUse.getUser());
- if (LI && LI->getPointerOperand() == allocaInst &&
- !LI->isVolatile()) {
- LI->setOperand(LI->getPointerOperandIndex(), AllocaInGenericAS);
- continue;
- }
- auto SI = dyn_cast<StoreInst>(AllocaUse.getUser());
- if (SI && SI->getPointerOperand() == allocaInst &&
- !SI->isVolatile()) {
- SI->setOperand(SI->getPointerOperandIndex(), AllocaInGenericAS);
- continue;
- }
- auto GI = dyn_cast<GetElementPtrInst>(AllocaUse.getUser());
- if (GI && GI->getPointerOperand() == allocaInst) {
- GI->setOperand(GI->getPointerOperandIndex(), AllocaInGenericAS);
- continue;
- }
- auto BI = dyn_cast<BitCastInst>(AllocaUse.getUser());
- if (BI && BI->getOperand(0) == allocaInst) {
- BI->setOperand(0, AllocaInGenericAS);
- continue;
- }
- }
- }
- }
- return Changed;
+ auto AddrSpaceStr = "A" + std::to_string(ADDRESS_SPACE_LOCAL);
+ assert(!StringRef(DLStr).contains("A") && "DataLayout should not contain A");
+ M.setDataLayout(DLStr.empty() ? AddrSpaceStr : DLStr + "-" + AddrSpaceStr);
+ return true;
}
FunctionPass *llvm::createNVPTXLowerAllocaPass() {
diff --git a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
index e2bbe57c0085c..2b18ca9dd774a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
@@ -529,7 +529,8 @@ void copyByValParam(Function &F, Argument &Arg) {
// the use of the byval parameter with this alloca instruction.
AllocA->setAlignment(
Arg.getParamAlign().value_or(DL.getPrefTypeAlign(StructType)));
- Arg.replaceAllUsesWith(AllocA);
+ auto *AddressSpaceCast = IRB.CreateAddrSpaceCast(AllocA, Arg.getType(), Arg.getName());
+ Arg.replaceAllUsesWith(AddressSpaceCast);
CallInst *ArgInParam = createNVVMInternalAddrspaceWrap(IRB, Arg);
diff --git a/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp b/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp
index 646b554878c70..0c56caeadcffe 100644
--- a/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp
@@ -119,7 +119,7 @@ bool NVPTXRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
MI.getOperand(FIOperandNum + 1).getImm();
// Using I0 as the frame pointer
- MI.getOperand(FIOperandNum).ChangeToRegister(getFrameRegister(MF), false);
+ MI.getOperand(FIOperandNum).ChangeToRegister(getFrameLocalRegister(MF), false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
return false;
}
@@ -127,14 +127,18 @@ bool NVPTXRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
Register NVPTXRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
const NVPTXTargetMachine &TM =
static_cast<const NVPTXTargetMachine &>(MF.getTarget());
- return TM.is64Bit() ? NVPTX::VRFrame64 : NVPTX::VRFrame32;
+ return TM.getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8
+ ? NVPTX::VRFrame64
+ : NVPTX::VRFrame32;
}
Register
NVPTXRegisterInfo::getFrameLocalRegister(const MachineFunction &MF) const {
const NVPTXTargetMachine &TM =
static_cast<const NVPTXTargetMachine &>(MF.getTarget());
- return TM.is64Bit() ? NVPTX::VRFrameLocal64 : NVPTX::VRFrameLocal32;
+ return TM.getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8
+ ? NVPTX::VRFrameLocal64
+ : NVPTX::VRFrameLocal32;
}
void NVPTXRegisterInfo::clearDebugRegisterMap() const {
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index 0603994606d71..209dd51a44a61 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -371,6 +371,8 @@ void NVPTXPassConfig::addIRPasses() {
if (getOptLevel() != CodeGenOptLevel::None) {
addAddressSpaceInferencePasses();
addStraightLineScalarOptimizationPasses();
+ } else {
+ addPass(createNVPTXLowerAllocaPass());
}
addPass(createAtomicExpandLegacyPass());
@@ -502,3 +504,25 @@ void NVPTXPassConfig::addMachineSSAOptimization() {
addPass(&PeepholeOptimizerLegacyID);
printAndVerify("After codegen peephole optimization pass");
}
+
+bool NVPTXTargetMachine::isCompatibleDataLayout(
+ const DataLayout &Candidate) const {
+ //XXX: Should we enforce that the Candidate DataLayout has the same address space for allocas?
+ if (DL == Candidate)
+ return true;
+
+ auto DLStr = DL.getStringRepresentation();
+ if (!StringRef(DLStr).contains("A"))
+ DLStr = DLStr.empty() ? "A" + std::to_string(ADDRESS_SPACE_LOCAL)
+ : DLStr + "-A" + std::to_string(ADDRESS_SPACE_LOCAL);
+ auto NewDL = DataLayout(DLStr);
+
+ return NewDL == Candidate;
+}
+
+unsigned NVPTXTargetMachine::getAddressSpaceForPseudoSourceKind(unsigned Kind) const {
+ if (Kind == PseudoSourceValue::FixedStack) {
+ return ADDRESS_SPACE_LOCAL;
+ }
+ return CodeGenTargetMachineImpl::getAddressSpaceForPseudoSourceKind(Kind);
+}
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h
index 118a01a0352f5..c2f09a89865c4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h
@@ -76,6 +76,9 @@ class NVPTXTargetMachine : public CodeGenTargetMachineImpl {
std::pair<const Value *, unsigned>
getPredicatedAddrSpace(const Value *V) const override;
+
+ bool isCompatibleDataLayout(const DataLayout &Candidate) const override;
+ unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const override;
}; // NVPTXTargetMachine.
class NVPTXTargetMachine32 : public NVPTXTargetMachine {
diff --git a/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll b/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll
index 0eb7f6462f6fa..e825f8ef19949 100644
--- a/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll
+++ b/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll
@@ -25,9 +25,10 @@ entry:
; CHECK: ld.param.b64 %rd[[A_REG:[0-9]+]], [kernel_func_param_0]
; CHECK: cvta.to.global.u64 %rd[[A1_REG:[0-9]+]], %rd[[A_REG]]
-; CHECK: add.u64 %rd[[SP_REG:[0-9]+]], %SP, 0
+; CHECK: add.u64 %rd[[SP_REG0:[0-9]+]], %SPL, 0
+; CHECK: cvta.local.u64 %rd[[SP_REG:[0-9]+]], %rd[[SP_REG0]];
; CHECK: ld.global.b32 %r[[A0_REG:[0-9]+]], [%rd[[A1_REG]]]
-; CHECK: st.local.b32 [{{%rd[0-9]+}}], %r[[A0_REG]]
+; CHECK: st.local.b32 [%SPL], %r[[A0_REG]]
%0 = load float, ptr %a, align 4
store float %0, ptr %buf, align 4
diff --git a/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll b/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll
index 0474d82556c1e..34702f1c177c5 100644
--- a/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll
+++ b/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll
@@ -13,13 +13,13 @@ define void @foo(i64 %a, ptr %p0, ptr %p1) {
; CHECK-NEXT: add.s64 %rd2, %rd1, 7;
; CHECK-NEXT: and.b64 %rd3, %rd2, -8;
; CHECK-NEXT: alloca.u64 %rd4, %rd3, 16;
-; CHECK-NEXT: cvta.local.u64 %rd5, %rd4;
-; CHECK-NEXT: ld.param.b64 %rd6, [foo_param_1];
-; CHECK-NEXT: alloca.u64 %rd7, %rd3, 16;
-; CHECK-NEXT: cvta.local.u64 %rd8, %rd7;
-; CHECK-NEXT: ld.param.b64 %rd9, [foo_param_2];
-; CHECK-NEXT: st.b64 [%rd6], %rd5;
-; CHECK-NEXT: st.b64 [%rd9], %rd8;
+; CHECK-NEXT: ld.param.b64 %rd5, [foo_param_1];
+; CHECK-NEXT: cvta.local.u64 %rd6, %rd4;
+; CHECK-NEXT: ld.param.b64 %rd7, [foo_param_2];
+; CHECK-NEXT: alloca.u64 %rd8, %rd3, 16;
+; CHECK-NEXT: cvta.local.u64 %rd9, %rd8;
+; CHECK-NEXT: st.b64 [%rd5], %rd6;
+; CHECK-NEXT: st.b64 [%rd7], %rd9;
; CHECK-NEXT: ret;
%b = alloca i8, i64 %a, align 16
%c = alloca i8, i64 %a, align 16
diff --git a/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll b/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll
index 7ca16f702d8f3..cf00c2c9eaa3f 100644
--- a/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll
@@ -86,15 +86,14 @@ define float @test_extract_i(<2 x float> %a, i64 %idx) #0 {
; CHECK-NOF32X2-EMPTY:
; CHECK-NOF32X2-NEXT: // %bb.0:
; CHECK-NOF32X2-NEXT: mov.b64 %SPL, __local_depot3;
-; CHECK-NOF32X2-NEXT: cvta.local.u64 %SP, %SPL;
; CHECK-NOF32X2-NEXT: ld.param.v2.b32 {%r1, %r2}, [test_extract_i_param_0];
; CHECK-NOF32X2-NEXT: ld.param.b64 %rd1, [test_extract_i_param_1];
-; CHECK-NOF32X2-NEXT: st.v2.b32 [%SP], {%r1, %r2};
+; CHECK-NOF32X2-NEXT: st.local.v2.b32 [%SPL], {%r1, %r2};
; CHECK-NOF32X2-NEXT: and.b64 %rd2, %rd1, 1;
; CHECK-NOF32X2-NEXT: shl.b64 %rd3, %rd2, 2;
-; CHECK-NOF32X2-NEXT: add.u64 %rd4, %SP, 0;
+; CHECK-NOF32X2-NEXT: add.u64 %rd4, %SPL, 0;
; CHECK-NOF32X2-NEXT: or.b64 %rd5, %rd4, %rd3;
-; CHECK-NOF32X2-NEXT: ld.b32 %r3, [%rd5];
+; CHECK-NOF32X2-NEXT: ld.local.b32 %r3, [%rd5];
; CHECK-NOF32X2-NEXT: st.param.b32 [func_retval0], %r3;
; CHECK-NOF32X2-NEXT: ret;
;
diff --git a/llvm/test/CodeGen/NVPTX/indirect_byval.ll b/llvm/test/CodeGen/NVPTX/indirect_byval.ll
index 673fb73948268..39813efef9b64 100644
--- a/llvm/test/CodeGen/NVPTX/indirect_byval.ll
+++ b/llvm/test/CodeGen/NVPTX/indirect_byval.ll
@@ -21,19 +21,18 @@ define internal i32 @foo() {
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0: // %entry
; CHECK-NEXT: mov.b64 %SPL, __local_depot0;
-; CHECK-NEXT: cvta.local.u64 %SP, %SPL;
-; CHECK-NEXT: ld.global.b64 %rd1, [ptr];
+; CHECK-NEXT: add.u64 %rd1, %SPL, 0;
+; CHECK-NEXT: cvta.local.u64 %rd2, %rd1;
+; CHECK-NEXT: ld.global.b64 %rd3, [ptr];
; CHECK-NEXT: { // callseq 0, 0
; CHECK-NEXT: .param .align 1 .b8 param0[1];
; CHECK-NEXT: .param .b64 param1;
; CHECK-NEXT: .param .b32 retval0;
-; CHECK-NEXT: add.u64 %rd2, %SP, 0;
; CHECK-NEXT: st.param.b64 [param1], %rd2;
-; CHECK-NEXT: add.u64 %rd3, %SPL, 1;
-; CHECK-NEXT: ld.local.b8 %rs1, [%rd3];
+; CHECK-NEXT: ld.local.b8 %rs1, [%SPL+1];
; CHECK-NEXT: st.param.b8 [param0], %rs1;
; CHECK-NEXT: prototype_0 : .callprototype (.param .b32 _) _ (.param .align 1 .b8 _[1], .param .b64 _);
-; CHECK-NEXT: call (retval0), %rd1, (param0, param1), prototype_0;
+; CHECK-NEXT: call (retval0), %rd3, (param0, param1), prototype_0;
; CHECK-NEXT: ld.param.b32 %r1, [retval0];
; CHECK-NEXT: } // callseq 0
; CHECK-NEXT: st.param.b32 [func_retval0], %r1;
@@ -58,19 +57,18 @@ define internal i32 @bar() {
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0: // %entry
; CHECK-NEXT: mov.b64 %SPL, __local_depot1;
-; CHECK-NEXT: cvta.local.u64 %SP, %SPL;
-; CHECK-NEXT: ld.global.b64 %rd1, [ptr];
+; CHECK-NEXT: add.u64 %rd1, %SPL, 0;
+; CHECK-NEXT: cvta.local.u64 %rd2, %rd1;
+; CHECK-NEXT: ld.global.b64 %rd3, [ptr];
; CHECK-NEXT: { // callseq 1, 0
; CHECK-NEXT: .param .align 8 .b8 param0[8];
; CHECK-NEXT: .param ...
[truncated]
|
|
@llvm/pr-subscribers-debuginfo Author: Theodoros Theodoridis (thetheodor) ChangesThis patch refactors NVPTXLowerAlloca to produce simpler IR for allocas. Previously, the implementation attached a pair of consecutive address space casts to each alloca: one from addrspace(0) (generic) to addrspace(5) (local), and another immediately back to addrspace(0). Downstream passes needed to recognize this idiom to generate efficient PTX. With this patch, NVPTXLowerAlloca directly changes the address space of each alloca to "local" and inserts a single addrspacecast from local back to generic. The InferAddressSpace pass can then remove the remaining cast. This change results in fewer address-space-change (ctva) instructions in the final PTX. Patch is 57.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154814.diff 22 Files Affected:
diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index bf4e490554723..f4cd06bf93d40 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -208,7 +208,7 @@ class LLVM_ABI TargetMachine {
/// The LLVM Module owns a DataLayout that is used for the target independent
/// optimizations and code generation. This hook provides a target specific
/// check on the validity of this DataLayout.
- bool isCompatibleDataLayout(const DataLayout &Candidate) const {
+ virtual bool isCompatibleDataLayout(const DataLayout &Candidate) const {
return DL == Candidate;
}
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 9d9b51db98702..2afa2bda53b52 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -114,6 +114,7 @@
#include "llvm/Pass.h"
#include "llvm/ProfileData/InstrProf.h"
#include "llvm/Support/AMDGPUAddrSpace.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
#include "llvm/Support/AtomicOrdering.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
@@ -4498,6 +4499,13 @@ void Verifier::visitAllocaInst(AllocaInst &AI) {
"alloca on amdgpu must be in addrspace(5)", &AI);
}
+ if (TT.isNVPTX()) {
+ Check(AI.getAddressSpace() == NVPTXAS::ADDRESS_SPACE_LOCAL ||
+ AI.getAddressSpace() == NVPTXAS::ADDRESS_SPACE_GENERIC,
+ "AllocaInst can only be in Generic or Local address space for NVPTX.",
+ &AI);
+ }
+
visitInstruction(AI);
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
index 7391c2d488b57..8528cb41d9244 100644
--- a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
@@ -1484,7 +1484,8 @@ void NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters(
if (NumBytes) {
O << "\t.local .align " << MFI.getMaxAlign().value() << " .b8 \t"
<< DEPOTNAME << getFunctionNumber() << "[" << NumBytes << "];\n";
- if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {
+ if (static_cast<const NVPTXTargetMachine &>(MF.getTarget())
+ .getPointerSize(ADDRESS_SPACE_LOCAL) == 8) {
O << "\t.reg .b64 \t%SP;\n"
<< "\t.reg .b64 \t%SPL;\n";
} else {
diff --git a/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
index 47bc15f52bb96..78f18a93b869b 100644
--- a/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXFrameLowering.cpp
@@ -48,8 +48,8 @@ void NVPTXFrameLowering::emitPrologue(MachineFunction &MF,
// mov %SPL, %depot;
// cvta.local %SP, %SPL;
// for local address accesses in MF.
- bool Is64Bit =
- static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit();
+ bool Is64Bit = static_cast<const NVPTXTargetMachine &>(MF.getTarget())
+ .getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8;
unsigned CvtaLocalOpcode =
(Is64Bit ? NVPTX::cvta_local_64 : NVPTX::cvta_local);
unsigned MovDepotOpcode =
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index bb4bb1195f78b..f35407ed99106 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -1803,10 +1803,7 @@ SDValue NVPTXTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op,
{Chain, DAG.getZExtOrTrunc(Size, DL, LocalVT),
DAG.getTargetConstant(Align, DL, MVT::i32)});
- SDValue ASC = DAG.getAddrSpaceCast(
- DL, Op.getValueType(), Alloc, ADDRESS_SPACE_LOCAL, ADDRESS_SPACE_GENERIC);
-
- return DAG.getMergeValues({ASC, SDValue(Alloc.getNode(), 1)}, DL);
+ return Alloc;
}
SDValue NVPTXTargetLowering::LowerSTACKRESTORE(SDValue Op,
diff --git a/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp b/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
index 88bc000f39bf7..03412edb9e23c 100644
--- a/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
@@ -6,16 +6,16 @@
//
//===----------------------------------------------------------------------===//
//
-// For all alloca instructions, and add a pair of cast to local address for
-// each of them. For example,
+// Change the Module's DataLayout to have the local address space for alloca's.
+// Change the address space of each alloca to local and add an addrspacecast to
+// generic address space. For example,
//
// %A = alloca i32
// store i32 0, i32* %A ; emits st.u32
//
// will be transformed to
//
-// %A = alloca i32
-// %Local = addrspacecast i32* %A to i32 addrspace(5)*
+// %A = alloca i32, addrspace(5)
// %Generic = addrspacecast i32 addrspace(5)* %A to i32*
// store i32 0, i32 addrspace(5)* %Generic ; emits st.local.u32
//
@@ -24,18 +24,24 @@
//
//===----------------------------------------------------------------------===//
-#include "MCTargetDesc/NVPTXBaseInfo.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
#include "NVPTX.h"
+#include "llvm/ADT/SmallVector.h"
#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Module.h"
#include "llvm/IR/Type.h"
#include "llvm/Pass.h"
using namespace llvm;
+using namespace NVPTXAS;
namespace {
class NVPTXLowerAlloca : public FunctionPass {
bool runOnFunction(Function &F) override;
+ bool doInitialization(Module &M) override;
public:
static char ID; // Pass identification, replacement for typeid
@@ -58,77 +64,55 @@ bool NVPTXLowerAlloca::runOnFunction(Function &F) {
if (skipFunction(F))
return false;
- bool Changed = false;
+ SmallVector<AllocaInst *, 16> Allocas;
for (auto &BB : F)
- for (auto &I : BB) {
- if (auto allocaInst = dyn_cast<AllocaInst>(&I)) {
- Changed = true;
+ for (auto &I : BB)
+ if (auto *Alloca = dyn_cast<AllocaInst>(&I))
+ if (Alloca->getAddressSpace() != ADDRESS_SPACE_LOCAL)
+ Allocas.push_back(Alloca);
+
+ if (Allocas.empty())
+ return false;
- PointerType *AllocInstPtrTy =
- cast<PointerType>(allocaInst->getType()->getScalarType());
- unsigned AllocAddrSpace = AllocInstPtrTy->getAddressSpace();
- assert((AllocAddrSpace == ADDRESS_SPACE_GENERIC ||
- AllocAddrSpace == ADDRESS_SPACE_LOCAL) &&
- "AllocaInst can only be in Generic or Local address space for "
- "NVPTX.");
+ for (AllocaInst *Alloca : Allocas) {
+ auto *NewAlloca = new AllocaInst(
+ Alloca->getAllocatedType(), ADDRESS_SPACE_LOCAL, Alloca->getArraySize(),
+ Alloca->getAlign(), Alloca->getName());
+ auto *Cast = new AddrSpaceCastInst(
+ NewAlloca,
+ PointerType::get(Alloca->getAllocatedType()->getContext(),
+ ADDRESS_SPACE_GENERIC),
+ "");
+ Cast->insertBefore(Alloca->getIterator());
+ NewAlloca->insertBefore(Cast->getIterator());
+ for (auto &U : llvm::make_early_inc_range(Alloca->uses())) {
+ auto *II = dyn_cast<IntrinsicInst>(U.getUser());
+ if (!II || (II->getIntrinsicID() != Intrinsic::lifetime_start &&
+ II->getIntrinsicID() != Intrinsic::lifetime_end))
+ continue;
- Instruction *AllocaInLocalAS = allocaInst;
- auto ETy = allocaInst->getAllocatedType();
+ IRBuilder<> Builder(II);
+ Builder.CreateIntrinsic(II->getIntrinsicID(), {NewAlloca->getType()},
+ {NewAlloca});
+ II->eraseFromParent();
+ }
- // We need to make sure that LLVM has info that alloca needs to go to
- // ADDRESS_SPACE_LOCAL for InferAddressSpace pass.
- //
- // For allocas in ADDRESS_SPACE_LOCAL, we add addrspacecast to
- // ADDRESS_SPACE_LOCAL and back to ADDRESS_SPACE_GENERIC, so that
- // the alloca's users still use a generic pointer to operate on.
- //
- // For allocas already in ADDRESS_SPACE_LOCAL, we just need
- // addrspacecast to ADDRESS_SPACE_GENERIC.
- if (AllocAddrSpace == ADDRESS_SPACE_GENERIC) {
- auto ASCastToLocalAS = new AddrSpaceCastInst(
- allocaInst,
- PointerType::get(ETy->getContext(), ADDRESS_SPACE_LOCAL), "");
- ASCastToLocalAS->insertAfter(allocaInst->getIterator());
- AllocaInLocalAS = ASCastToLocalAS;
- }
+ Alloca->replaceAllUsesWith(Cast);
+ Alloca->eraseFromParent();
+ }
+ return true;
+}
- auto AllocaInGenericAS = new AddrSpaceCastInst(
- AllocaInLocalAS,
- PointerType::get(ETy->getContext(), ADDRESS_SPACE_GENERIC), "");
- AllocaInGenericAS->insertAfter(AllocaInLocalAS->getIterator());
+bool NVPTXLowerAlloca::doInitialization(Module &M) {
+ const auto &DL = M.getDataLayout();
+ if (DL.getAllocaAddrSpace() == ADDRESS_SPACE_LOCAL)
+ return false;
+ auto DLStr = DL.getStringRepresentation();
- for (Use &AllocaUse : llvm::make_early_inc_range(allocaInst->uses())) {
- // Check Load, Store, GEP, and BitCast Uses on alloca and make them
- // use the converted generic address, in order to expose non-generic
- // addrspacecast to NVPTXInferAddressSpaces. For other types
- // of instructions this is unnecessary and may introduce redundant
- // address cast.
- auto LI = dyn_cast<LoadInst>(AllocaUse.getUser());
- if (LI && LI->getPointerOperand() == allocaInst &&
- !LI->isVolatile()) {
- LI->setOperand(LI->getPointerOperandIndex(), AllocaInGenericAS);
- continue;
- }
- auto SI = dyn_cast<StoreInst>(AllocaUse.getUser());
- if (SI && SI->getPointerOperand() == allocaInst &&
- !SI->isVolatile()) {
- SI->setOperand(SI->getPointerOperandIndex(), AllocaInGenericAS);
- continue;
- }
- auto GI = dyn_cast<GetElementPtrInst>(AllocaUse.getUser());
- if (GI && GI->getPointerOperand() == allocaInst) {
- GI->setOperand(GI->getPointerOperandIndex(), AllocaInGenericAS);
- continue;
- }
- auto BI = dyn_cast<BitCastInst>(AllocaUse.getUser());
- if (BI && BI->getOperand(0) == allocaInst) {
- BI->setOperand(0, AllocaInGenericAS);
- continue;
- }
- }
- }
- }
- return Changed;
+ auto AddrSpaceStr = "A" + std::to_string(ADDRESS_SPACE_LOCAL);
+ assert(!StringRef(DLStr).contains("A") && "DataLayout should not contain A");
+ M.setDataLayout(DLStr.empty() ? AddrSpaceStr : DLStr + "-" + AddrSpaceStr);
+ return true;
}
FunctionPass *llvm::createNVPTXLowerAllocaPass() {
diff --git a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
index e2bbe57c0085c..2b18ca9dd774a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
@@ -529,7 +529,8 @@ void copyByValParam(Function &F, Argument &Arg) {
// the use of the byval parameter with this alloca instruction.
AllocA->setAlignment(
Arg.getParamAlign().value_or(DL.getPrefTypeAlign(StructType)));
- Arg.replaceAllUsesWith(AllocA);
+ auto *AddressSpaceCast = IRB.CreateAddrSpaceCast(AllocA, Arg.getType(), Arg.getName());
+ Arg.replaceAllUsesWith(AddressSpaceCast);
CallInst *ArgInParam = createNVVMInternalAddrspaceWrap(IRB, Arg);
diff --git a/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp b/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp
index 646b554878c70..0c56caeadcffe 100644
--- a/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXRegisterInfo.cpp
@@ -119,7 +119,7 @@ bool NVPTXRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
MI.getOperand(FIOperandNum + 1).getImm();
// Using I0 as the frame pointer
- MI.getOperand(FIOperandNum).ChangeToRegister(getFrameRegister(MF), false);
+ MI.getOperand(FIOperandNum).ChangeToRegister(getFrameLocalRegister(MF), false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
return false;
}
@@ -127,14 +127,18 @@ bool NVPTXRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
Register NVPTXRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
const NVPTXTargetMachine &TM =
static_cast<const NVPTXTargetMachine &>(MF.getTarget());
- return TM.is64Bit() ? NVPTX::VRFrame64 : NVPTX::VRFrame32;
+ return TM.getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8
+ ? NVPTX::VRFrame64
+ : NVPTX::VRFrame32;
}
Register
NVPTXRegisterInfo::getFrameLocalRegister(const MachineFunction &MF) const {
const NVPTXTargetMachine &TM =
static_cast<const NVPTXTargetMachine &>(MF.getTarget());
- return TM.is64Bit() ? NVPTX::VRFrameLocal64 : NVPTX::VRFrameLocal32;
+ return TM.getPointerSize(NVPTXAS::ADDRESS_SPACE_LOCAL) == 8
+ ? NVPTX::VRFrameLocal64
+ : NVPTX::VRFrameLocal32;
}
void NVPTXRegisterInfo::clearDebugRegisterMap() const {
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index 0603994606d71..209dd51a44a61 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -371,6 +371,8 @@ void NVPTXPassConfig::addIRPasses() {
if (getOptLevel() != CodeGenOptLevel::None) {
addAddressSpaceInferencePasses();
addStraightLineScalarOptimizationPasses();
+ } else {
+ addPass(createNVPTXLowerAllocaPass());
}
addPass(createAtomicExpandLegacyPass());
@@ -502,3 +504,25 @@ void NVPTXPassConfig::addMachineSSAOptimization() {
addPass(&PeepholeOptimizerLegacyID);
printAndVerify("After codegen peephole optimization pass");
}
+
+bool NVPTXTargetMachine::isCompatibleDataLayout(
+ const DataLayout &Candidate) const {
+ //XXX: Should we enforce that the Candidate DataLayout has the same address space for allocas?
+ if (DL == Candidate)
+ return true;
+
+ auto DLStr = DL.getStringRepresentation();
+ if (!StringRef(DLStr).contains("A"))
+ DLStr = DLStr.empty() ? "A" + std::to_string(ADDRESS_SPACE_LOCAL)
+ : DLStr + "-A" + std::to_string(ADDRESS_SPACE_LOCAL);
+ auto NewDL = DataLayout(DLStr);
+
+ return NewDL == Candidate;
+}
+
+unsigned NVPTXTargetMachine::getAddressSpaceForPseudoSourceKind(unsigned Kind) const {
+ if (Kind == PseudoSourceValue::FixedStack) {
+ return ADDRESS_SPACE_LOCAL;
+ }
+ return CodeGenTargetMachineImpl::getAddressSpaceForPseudoSourceKind(Kind);
+}
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h
index 118a01a0352f5..c2f09a89865c4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.h
@@ -76,6 +76,9 @@ class NVPTXTargetMachine : public CodeGenTargetMachineImpl {
std::pair<const Value *, unsigned>
getPredicatedAddrSpace(const Value *V) const override;
+
+ bool isCompatibleDataLayout(const DataLayout &Candidate) const override;
+ unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const override;
}; // NVPTXTargetMachine.
class NVPTXTargetMachine32 : public NVPTXTargetMachine {
diff --git a/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll b/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll
index 0eb7f6462f6fa..e825f8ef19949 100644
--- a/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll
+++ b/llvm/test/CodeGen/NVPTX/call-with-alloca-buffer.ll
@@ -25,9 +25,10 @@ entry:
; CHECK: ld.param.b64 %rd[[A_REG:[0-9]+]], [kernel_func_param_0]
; CHECK: cvta.to.global.u64 %rd[[A1_REG:[0-9]+]], %rd[[A_REG]]
-; CHECK: add.u64 %rd[[SP_REG:[0-9]+]], %SP, 0
+; CHECK: add.u64 %rd[[SP_REG0:[0-9]+]], %SPL, 0
+; CHECK: cvta.local.u64 %rd[[SP_REG:[0-9]+]], %rd[[SP_REG0]];
; CHECK: ld.global.b32 %r[[A0_REG:[0-9]+]], [%rd[[A1_REG]]]
-; CHECK: st.local.b32 [{{%rd[0-9]+}}], %r[[A0_REG]]
+; CHECK: st.local.b32 [%SPL], %r[[A0_REG]]
%0 = load float, ptr %a, align 4
store float %0, ptr %buf, align 4
diff --git a/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll b/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll
index 0474d82556c1e..34702f1c177c5 100644
--- a/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll
+++ b/llvm/test/CodeGen/NVPTX/dynamic-stackalloc-regression.ll
@@ -13,13 +13,13 @@ define void @foo(i64 %a, ptr %p0, ptr %p1) {
; CHECK-NEXT: add.s64 %rd2, %rd1, 7;
; CHECK-NEXT: and.b64 %rd3, %rd2, -8;
; CHECK-NEXT: alloca.u64 %rd4, %rd3, 16;
-; CHECK-NEXT: cvta.local.u64 %rd5, %rd4;
-; CHECK-NEXT: ld.param.b64 %rd6, [foo_param_1];
-; CHECK-NEXT: alloca.u64 %rd7, %rd3, 16;
-; CHECK-NEXT: cvta.local.u64 %rd8, %rd7;
-; CHECK-NEXT: ld.param.b64 %rd9, [foo_param_2];
-; CHECK-NEXT: st.b64 [%rd6], %rd5;
-; CHECK-NEXT: st.b64 [%rd9], %rd8;
+; CHECK-NEXT: ld.param.b64 %rd5, [foo_param_1];
+; CHECK-NEXT: cvta.local.u64 %rd6, %rd4;
+; CHECK-NEXT: ld.param.b64 %rd7, [foo_param_2];
+; CHECK-NEXT: alloca.u64 %rd8, %rd3, 16;
+; CHECK-NEXT: cvta.local.u64 %rd9, %rd8;
+; CHECK-NEXT: st.b64 [%rd5], %rd6;
+; CHECK-NEXT: st.b64 [%rd7], %rd9;
; CHECK-NEXT: ret;
%b = alloca i8, i64 %a, align 16
%c = alloca i8, i64 %a, align 16
diff --git a/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll b/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll
index 7ca16f702d8f3..cf00c2c9eaa3f 100644
--- a/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/f32x2-instructions.ll
@@ -86,15 +86,14 @@ define float @test_extract_i(<2 x float> %a, i64 %idx) #0 {
; CHECK-NOF32X2-EMPTY:
; CHECK-NOF32X2-NEXT: // %bb.0:
; CHECK-NOF32X2-NEXT: mov.b64 %SPL, __local_depot3;
-; CHECK-NOF32X2-NEXT: cvta.local.u64 %SP, %SPL;
; CHECK-NOF32X2-NEXT: ld.param.v2.b32 {%r1, %r2}, [test_extract_i_param_0];
; CHECK-NOF32X2-NEXT: ld.param.b64 %rd1, [test_extract_i_param_1];
-; CHECK-NOF32X2-NEXT: st.v2.b32 [%SP], {%r1, %r2};
+; CHECK-NOF32X2-NEXT: st.local.v2.b32 [%SPL], {%r1, %r2};
; CHECK-NOF32X2-NEXT: and.b64 %rd2, %rd1, 1;
; CHECK-NOF32X2-NEXT: shl.b64 %rd3, %rd2, 2;
-; CHECK-NOF32X2-NEXT: add.u64 %rd4, %SP, 0;
+; CHECK-NOF32X2-NEXT: add.u64 %rd4, %SPL, 0;
; CHECK-NOF32X2-NEXT: or.b64 %rd5, %rd4, %rd3;
-; CHECK-NOF32X2-NEXT: ld.b32 %r3, [%rd5];
+; CHECK-NOF32X2-NEXT: ld.local.b32 %r3, [%rd5];
; CHECK-NOF32X2-NEXT: st.param.b32 [func_retval0], %r3;
; CHECK-NOF32X2-NEXT: ret;
;
diff --git a/llvm/test/CodeGen/NVPTX/indirect_byval.ll b/llvm/test/CodeGen/NVPTX/indirect_byval.ll
index 673fb73948268..39813efef9b64 100644
--- a/llvm/test/CodeGen/NVPTX/indirect_byval.ll
+++ b/llvm/test/CodeGen/NVPTX/indirect_byval.ll
@@ -21,19 +21,18 @@ define internal i32 @foo() {
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0: // %entry
; CHECK-NEXT: mov.b64 %SPL, __local_depot0;
-; CHECK-NEXT: cvta.local.u64 %SP, %SPL;
-; CHECK-NEXT: ld.global.b64 %rd1, [ptr];
+; CHECK-NEXT: add.u64 %rd1, %SPL, 0;
+; CHECK-NEXT: cvta.local.u64 %rd2, %rd1;
+; CHECK-NEXT: ld.global.b64 %rd3, [ptr];
; CHECK-NEXT: { // callseq 0, 0
; CHECK-NEXT: .param .align 1 .b8 param0[1];
; CHECK-NEXT: .param .b64 param1;
; CHECK-NEXT: .param .b32 retval0;
-; CHECK-NEXT: add.u64 %rd2, %SP, 0;
; CHECK-NEXT: st.param.b64 [param1], %rd2;
-; CHECK-NEXT: add.u64 %rd3, %SPL, 1;
-; CHECK-NEXT: ld.local.b8 %rs1, [%rd3];
+; CHECK-NEXT: ld.local.b8 %rs1, [%SPL+1];
; CHECK-NEXT: st.param.b8 [param0], %rs1;
; CHECK-NEXT: prototype_0 : .callprototype (.param .b32 _) _ (.param .align 1 .b8 _[1], .param .b64 _);
-; CHECK-NEXT: call (retval0), %rd1, (param0, param1), prototype_0;
+; CHECK-NEXT: call (retval0), %rd3, (param0, param1), prototype_0;
; CHECK-NEXT: ld.param.b32 %r1, [retval0];
; CHECK-NEXT: } // callseq 0
; CHECK-NEXT: st.param.b32 [func_retval0], %r1;
@@ -58,19 +57,18 @@ define internal i32 @bar() {
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0: // %entry
; CHECK-NEXT: mov.b64 %SPL, __local_depot1;
-; CHECK-NEXT: cvta.local.u64 %SP, %SPL;
-; CHECK-NEXT: ld.global.b64 %rd1, [ptr];
+; CHECK-NEXT: add.u64 %rd1, %SPL, 0;
+; CHECK-NEXT: cvta.local.u64 %rd2, %rd1;
+; CHECK-NEXT: ld.global.b64 %rd3, [ptr];
; CHECK-NEXT: { // callseq 1, 0
; CHECK-NEXT: .param .align 8 .b8 param0[8];
; CHECK-NEXT: .param ...
[truncated]
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
AlexMaclean
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I think this is a good direction to go. We have had to add a lot of hacks to keep things working with generic allocas in the IR.
clang/lib/CodeGen/BackendUtil.cpp
Outdated
| if (AsmHelper.TM) | ||
| if (!AsmHelper.TM->isCompatibleDataLayout(M->getDataLayout()) || | ||
| !AsmHelper.TM->isCompatibleDataLayout(DataLayout(TDesc))) { | ||
| std::string DLDesc = M->getDataLayout().getStringRepresentation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several tests were failing because of this. clang checks if the module's layout is the same as the original.
I modified this to use isCompatibleDataLayout but I'm not sure if this a good solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the failures? What exactly the DL strings that failed to match? Where did they come from? I think in general there should be no mismatch between clang and LLVM regarding what they consider to be the right data layout for the module. If they disagree, we need to figure out why and fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
/usr/bin/python3 /work/llvm-project/build/./bin/llvm-lit -vv /work/llvm-project/clang/test/CodeGenCUDA/bf16.cu
...
error: backend data layout 'e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64-A5' does not match expected target description 'e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64'
1 error generated when compiling for .
it's failing because the -A5 part (added by NVPTXLowerAlloca) was not part of the initial data layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, the issue was that the datalayout in clang/lib/Basic/Targets/NVPTX.cpp was not updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The side effect of this is that multiple (very long) OpenMP test had to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to completely update everywhere to use A5. This isn't totally new ground since this seems to be what AMDGPU does but it is a really big change that I'm not sure what all the implications would be
Hypothetically, I'd expect to see cases where existing code will bail out when it sees non-default AS, and the code that should bail out, but does not, because it does not bother to check whether the AS is non-default. Some of those will be caught by assertions, but some will happen to work and will remain silent. That's one of the reasons I'm somewhat reluctant about introducing AS on allocas early on -- the potential problem surface is pretty much all LLVM passes.
Unless there's a clear benefit for NVPTX back-end users that's worth the risk, and the the extra ASCs to go with the allocas through all the passes, I'd rather settler for a bit of extra complexity localized to the NVPTX back-end.
Stated intent of "produce simpler IR for allocas" alone just does not quite reach the bar of being worthwhile, IMO.
Allocas that remain in the PTX as always bad news for performance, so an extra address conversion instruction usually lost in the noise, and does not matter at all. I do not see much of a practical benefit even in the best case for this patch from the performance standpoint.
So far it looks like a wash to me. We may end up with potentially simpler/cleaner lowering for the allocase (with minimal/no benefit to the actual performance), but pay for it with an increased risk of unintended side effects (moderate? touches everything that touches allocas in all the passes), an incremental bump to the IR size to be processed by all the passes (granted, the impact of a few extra ASCs on compile time is probably not measurable in practice, but it's non-zero), and the fair amount of churn for the existing tests (not a showstopper, just a lot of mechanical changes).
Is it really worth doing/beneficial? What's the best case outcome we can expect from the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypothetically, I'd expect to see cases where existing code will bail out when it sees non-default AS, and the code that should bail out, but does not, because it does not bother to check whether the AS is non-default. Some of those will be caught by assertions, but some will happen to work and will remain silent.
I doubt there will be many places where existing LLVM passes are mishandling allocas in a specific AS. The fact that AMDGPU is already using a specific AS for allocas makes me think that the support for this feature is already reasonably good. There might be some edge cases somewhere which need to be fixed, but overall think these hypothetical bugs should not be a major factor in considering which approach to choose.
Is it really worth doing/beneficial? What's the best case outcome we can expect from the patch?
I'd still lean towards switching to local allocas. This seems to me like it provides a more accurate representation of the machine in the LLVM IR. While the IR might be bigger when initially emitted from clang or unoptimized, once InferAddressSpace is run, it will be smaller with specific allocas since we'll no longer need to wrap every generic alloca in a cast to it's true address space. We should probably consider moving InferAddressSpace earlier to eliminate the size issue and this would have additional benefits such as improving subsequent alias-analysis compile-time.
In general, this seems like this change allows us to eliminate a lot of hacks and work-arounds from the backend, in some cases improving the quality of emitted IR (I agree these quality improvements seem very marginal but I think it's still a win and there may be cases where they do make a difference). There are definitely some switching costs, and I'm not sure how best to handle the transition, but the final destination seems preferable even if it's not a game-changer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMDGPU is already using a specific AS for allocas makes me think that the support for this feature is already reasonably good. There might be some edge cases somewhere which need to be fixed, but overall think these hypothetical bugs should not be a major factor in considering which approach to choose.
Fair enough. Agreed.
I'd still lean towards switching to local allocas. This seems to me like it provides a more accurate representation of the machine in the LLVM IR.
If we explicitly specify AS for local variables, then we should be doing that explicitly for other variables as well. AS knowledge does not benefit generic LLVM passes and is required for the NVPTX purposes only. I'm still not convinced that the proposed change buys us anything useful, other than DL being nominally closer to what the back-end needs.
While the IR might be bigger when initially emitted from clang or unoptimized, once InferAddressSpace is run, it will be smaller with specific allocas since we'll no longer need to wrap every generic alloca in a cast to it's true address space.
In that case, InferAddressSpace should also be able to convert existing allocas to the correct AS without changing the default AS. We can infer AS for the newly materialized allocas with a late pass of InferAddressSpace.
So, it looks like the usefulness of the patch boils down to what to do about the allocas materialized throughout the compilation pipeline. We have the following scenarios:
- a) Current implementation, before the patch: All allocas start in AS0, and inferred to be local by a late InferAddressSpace pass.
- b) Proposed implementation: Makes the default AS for allocas to be local. Allows all allocas to be materialized with the correct AS. DL changes require all users to update the IR they generate (we'll presumably auto-upgrade IR with the pre-patch DL), and after the early run of InferAddressSpace will eliminate the redundant ASCs back to generic AS.
- c) Half-way proposal. Run InferAddressSpace early to give existing allocas correct AS, run another InferAddressSpace late in the pipeline to catch newly materialized generic allocas. It gives us some of the alias analysis benefits of the approach above, but without the disruption of changing DL. Effectiveness of this approach will be better than the status quo, but less than the change of the default alloca AS to local. By new, I mean the allocas that are not the result of splitting/trimming the existing alloca, as in such cases I would assume the AS to be inherited from the old alloca, which would preserve the local AS. I do not have a good idea of what's a typical ratio of pre-existing allocas vs the new allocas materialized by compiler. If new allocas are rare, then the AA effectiveness will asymptotically approach that of the (b).
My issues with (b) are mainly the invasiveness of DL change, and the fact that if the bring the idea of setting DL in a way that reflects the back-end AS where the data would live, then the same argument should apply to other data, not just local variables. It would benefit AA, but it does not look like something we should be forcing on the users. I think it's something that belongs under the compiler hood, IMO. No need to force users to do something compiler is quite capable of doing itself.
Perhaps we can have the cake and eat it here.
The fact that Data layout allows us to specify the default alloca AS does not mean that we have to do it that way. In practice, we'll still need to allow user IR to use default AS for allocas, and we will need to run InferAddressSpace at some point. I'm fine doing it early. It leaves us with the question of the AS for the new allocas. I wonder whether we could have a parallel hint for the default AS. If the DL specifies it, use DL-specified one. If DL says nothing, check with the target. We'll still need to run InferAddressSpace once more before lowering, but we should be able to reap most of the AA benefits with no churn for the user.
This would also using (b) for experiments via explicit DL (and we can change the DL later if it proves to be the best way to handle it all), but it also avoids disrupting the existing users, and it gives us flexibility for how we handle allocas under the hood while we're sorting it out.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issues with (b) are mainly the invasiveness of DL change, and the fact that if the bring the idea of setting DL in a way that reflects the back-end AS where the data would live, then the same argument should apply to other data, not just local variables.
I agree, but LLVM already has support for allocas, whereas (afaik) for other types (e.g., shared variables) there is currently no support. For the former we simply need to change the DL and do a very simple autoupgrade of the input IR, whereas for globals/shared/etc the front-ends would have to emit different code (or we'd have to do a much more complicated autoupgrade?). But is this a good reason not to go ahead with setting the proper address space for allocas?
No need to force users to do something compiler is quite capable of doing itself.
Maybe I'm misunderstanding something. But shouldn't autoupgrade take care of this? Users are welcome to specify the alloca address space, but they won't have to.
The fact that Data layout allows us to specify the default alloca AS does not mean that we have to do it that way. In practice, we'll still need to allow user IR to use default AS for allocas, and we will need to run InferAddressSpace at some point.
Would there be any benefit in not autoupgrading allocas to the local address space? Especially if we run InferAddressSpace early, wouldn't we more or less end up in the same spot?
It leaves us with the question of the AS for the new allocas. I wonder whether we could have a parallel hint for the default AS. If the DL specifies it, use DL-specified one. If DL says nothing, check with the target.
One option would be to do this in IRBuilder::CreateAlloca and check what the target wants if the alloca AS is unspecified (i.e., is zero). But I have some concerns with this:
- If we want to support both DLs (with or without specified alloca AS) then we need some way to construct two different layouts. Maybe this can be handled similarly to
--nvptx-short-ptr. - Would this not introduce a larger surface for bugs or divergent behavior? If the NVPTX target must work with both
-A5and-A0DLs, then we'll have to identify, re-implement/fix, and test anything that relies on the DL's alloca AS. - Maybe I'm misunderstanding the proposal, but the main idea is that we support two scenarios:
- If the DL specifies
-A5, all allocas are expected to be in the local address space. - If the DL does not specify the alloca address space (or specifies
-A0) we rely onInferAddressSpaces(potentially run early) to fix the existing allocas and either we rely on another lateInferAddressSpacesor, if possible, we modifyCreateAlloca(or something similar) to consult the Target when creating allocas. My question is how are these two scenarios different from the users' perspective? For example, clang will automatically set the correct DL to compile a .cu file and for existing IR we can simply autoupgrade it. Am I missing something? 🤔
- If the DL specifies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your proposal is workable, I just do not see enough concrete benefits to justify the churn of changing DL. As things stand, it mostly shuffles ASCs around space and time, making few things easier, and few things a bit more cumbersome. If there are some concrete examples of this change paving the way for something with measurable benefits, I would be delighted to revise my assessment.
That said, autoupgrade (can we autoupgrade DL? I never tried.) is probably going to mitigate the hassle for the current users that use current DL and generate AS0-based allocas, so it's not going to make things that much worse, either.
If it was a small local change, I'd just stamp it and moved on. But for a moderately large patch touching a dozen files, changing DL, requiring autoupgrade, it's just not worth it, IMO.
So, I'm still not convinced that we need this change, but if there's plausible evidence that the change will be useful, or of someone else thinks that we want or need this change, I'll be OK with that.
To make it more concrete, let's start with the stated goal:
This change results in fewer address-space-change (ctva) instructions in the final PTX.
In all the tests in the patch I see only two instances of cvta.local.u64 eliminated in llvm/test/CodeGen/NVPTX/lower-byval-args.ll, and it's hard to tell whether the patch just shifted stuff around sufficiently to allow LLVM eliminate unnecessary writes (I think this particular problem had some notable recent improvements) or if it directly contributed to the better code. If you could add some tests where the patch clearly removes more cvta instructions from PTX, that would help.
efriedma-quic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying the datalayout feels really strange, and I'm sort of worried it'll have unintended consequences, especially if there's only one target doing it.
Maybe you can just skip modifying the datalayout? The passes that run after createNVPTXLowerAllocaPass realistically shouldn't be creating allocas, and if one does slip through somehow you can fix it up in SelectionDAG.
clang/lib/CodeGen/BackendUtil.cpp
Outdated
| if (AsmHelper.TM) | ||
| if (!AsmHelper.TM->isCompatibleDataLayout(M->getDataLayout()) || | ||
| !AsmHelper.TM->isCompatibleDataLayout(DataLayout(TDesc))) { | ||
| std::string DLDesc = M->getDataLayout().getStringRepresentation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the failures? What exactly the DL strings that failed to match? Where did they come from? I think in general there should be no mismatch between clang and LLVM regarding what they consider to be the right data layout for the module. If they disagree, we need to figure out why and fix that.
is failing but I can't reproduce this locally, I'm trying to figure out how my config is different than the test bot's. |
|
If you didn't dig it out of the logs already, you can see the test output here: https://github.com/llvm/llvm-project/actions/runs/17434074847?pr=154814 (no idea why it's failing, just a drive by tip) |
Thanks. It's failing because |
|
Strange. I had to rebase my local branch on top of main to trigger the same failure. Does this mean that the test bots rebase/merge before building and testing? |
Yes. I think the intent was to catch earlier failures that would happen when it's merged, but I don't know how as a PR author you're supposed to realise that it got rebased. |
This patch refactors NVPTXLowerAlloca to produce simpler IR for allocas. Previously, the implementation attached a pair of consecutive address space casts to each alloca: one from addrspace(0) (generic) to addrspace(5) (local), and another immediately back to addrspace(0). Downstream passes needed to recognize this idiom to generate efficient PTX. With this patch, NVPTXLowerAlloca directly changes the address space of each alloca to "local" and inserts a single addrspacecast from local back to generic. The InferAddressSpace pass can then remove the remaining cast. This change results in fewer address-space-change (ctva) instructions in the final PTX.