-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[MachineOutliner] Preserve regmasks in calls to outlined functions #120940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-aarch64 Author: Zhaoxuan Jiang (nocchijiang) ChangesWhen emitting calls to an outlined function, the register masks from the outlined sequence are lost. The AArch64CollectLOH pass, which I plan to move to PreEmitPass2 (positioned after MachineOutliner), relies on accurate register masks. This patch ensures that regmasks are correctly preserved in the outlined calls, maintaining the required accuracy for subsequent passes. Full diff: https://github.com/llvm/llvm-project/pull/120940.diff 2 Files Affected:
diff --git a/llvm/lib/CodeGen/MachineOutliner.cpp b/llvm/lib/CodeGen/MachineOutliner.cpp
index 4c5489434c69bb..c01d9b04a89d93 100644
--- a/llvm/lib/CodeGen/MachineOutliner.cpp
+++ b/llvm/lib/CodeGen/MachineOutliner.cpp
@@ -1117,6 +1117,7 @@ bool MachineOutliner::outline(
// instruction. It also updates call site information for moved
// code.
SmallSet<Register, 2> UseRegs, DefRegs;
+ SmallPtrSet<const uint32_t *, 2> RegMasks;
// Copy over the defs in the outlined range.
// First inst in outlined range <-- Anything that's defined in this
// ... .. range has to be added as an
@@ -1130,6 +1131,12 @@ bool MachineOutliner::outline(
MachineInstr *MI = &*Iter;
SmallSet<Register, 2> InstrUseRegs;
for (MachineOperand &MOP : MI->operands()) {
+ // Collect all regmasks. Merge them in the end.
+ if (MOP.isRegMask()) {
+ RegMasks.insert(MOP.getRegMask());
+ continue;
+ }
+
// Skip over anything that isn't a register.
if (!MOP.isReg())
continue;
@@ -1153,6 +1160,24 @@ bool MachineOutliner::outline(
MI->getMF()->eraseCallSiteInfo(MI);
}
+ if (!RegMasks.empty()) {
+ if (RegMasks.size() == 1) {
+ CallInst->addOperand(
+ MachineOperand::CreateRegMask(*RegMasks.begin()));
+ } else {
+ uint32_t *RegMask = MF->allocateRegMask();
+ unsigned NumRegs =
+ MF->getSubtarget().getRegisterInfo()->getNumRegs();
+ unsigned Size = MachineOperand::getRegMaskSize(NumRegs);
+ memset(RegMask, UINT32_MAX, Size * sizeof(RegMask[0]));
+ for (const uint32_t *Mask : RegMasks) {
+ for (unsigned I = 0; I < NumRegs; ++I)
+ RegMask[I] &= Mask[I];
+ }
+ CallInst->addOperand(MachineOperand::CreateRegMask(RegMask));
+ }
+ }
+
for (const Register &I : DefRegs)
// If it's a def, add it to the call instruction.
CallInst->addOperand(
diff --git a/llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir b/llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir
new file mode 100644
index 00000000000000..047a73f81dd2ae
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir
@@ -0,0 +1,94 @@
+# RUN: llc -mtriple=aarch64-apple-ios -run-pass=prologepilog -run-pass=machine-outliner %s -o - | FileCheck %s
+--- |
+ declare swiftcc void @bar()
+ declare void @baz(i32, i32, i32) #0
+
+ define void @test_same_regmask() #0 {
+ ret void
+ }
+ define void @test_different_regmasks() #0 {
+ ret void
+ }
+ define void @foo(i32, i32, i32, i32, i32, i32, i32, i32) #0 {
+ ret void
+ }
+
+...
+---
+name: foo
+tracksRegLiveness: true
+body: |
+ bb.0:
+ RET undef $lr
+
+
+...
+---
+name: test_same_regmask
+tracksRegLiveness: true
+body: |
+ ; CHECK-LABEL: name: test_same_regmask
+ ; CHECK: bb.1:
+ ; CHECK-NEXT: BL @OUTLINED_FUNCTION_1, csr_aarch64_aapcs
+
+ bb.0:
+ $sp = frame-setup SUBXri $sp, 16, 0
+
+ bb.1:
+ $w0 = MOVZWi 1, 0
+ $w1 = MOVZWi 2, 0
+ $w2 = MOVZWi 3, 0
+ BL @baz, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit-def $sp
+ $w0 = MOVZWi 1, 0
+ $w1 = MOVZWi 2, 0
+ $w2 = MOVZWi 3, 0
+ BL @baz, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit-def $sp
+ $sp = ADDXri $sp, 16, 0
+ RET undef $lr
+
+
+...
+---
+name: test_different_regmasks
+tracksRegLiveness: true
+body: |
+ ; CHECK-LABEL: name: test_different_regmasks
+ ; CHECK: bb.1:
+ ; CHECK-NEXT: BL @OUTLINED_FUNCTION_0, CustomRegMask($fp,$lr,$wzr,$wzr_hi,$xzr,$b8,$b9,$b10,$b11,$b12,$b13,$b14,$b15,$d8,$d9,$d10,$d11,$d12,$d13,$d14,$d15,$h8,$h9,$h10,$h11,$h12,$h13,$h14,$h15,$s8,$s9,$s10,$s11,$s12,$s13,$s14,$s15,$w19,$w20,$w22,$w23,$w24,$w25,$w26,$w27,$w28,$w29,$w30,$x19,$x20,$x22,$x23,$x24,$x25,$x26,$x27,$x28,$b8_hi,$b9_hi,$b10_hi,$b11_hi,$b12_hi,$b13_hi,$b14_hi,$b15_hi,$h8_hi,$h9_hi,$h10_hi,$h11_hi,$h12_hi,$h13_hi,$h14_hi,$h15_hi,$s8_hi,$s9_hi,$s10_hi,$s11_hi,$s12_hi,$s13_hi,$s14_hi,$s15_hi,$w19_hi,$w20_hi,$w22_hi,$w23_hi,$w24_hi,$w25_hi,$w26_hi,$w27_hi,$w28_hi,$w29_hi,$w30_hi,$d8_d9,$d9_d10,$d10_d11,$d11_d12,$d12_d13,$d13_d14,$d14_d15,$d8_d9_d10_d11,$d9_d10_d11_d12,$d10_d11_d12_d13,$d11_d12_d13_d14,$d12_d13_d14_d15,$d8_d9_d10,$d9_d10_d11,$d10_d11_d12,$d11_d12_d13,$d12_d13_d14,$d13_d14_d15,$x22_x23_x24_x25_x26_x27_x28_fp,$w22_w23,$w24_w25,$w26_w27,$w28_w29,$x28_fp,$x22_x23,$x24_x25,$x26_x27)
+
+ bb.0:
+ $sp = frame-setup SUBXri $sp, 16, 0
+
+ bb.1:
+ $w0 = MOVZWi 1, 0
+ $w1 = MOVZWi 2, 0
+ $w2 = MOVZWi 3, 0
+ $w3 = MOVZWi 4, 0
+ $w4 = MOVZWi 5, 0
+ $w5 = MOVZWi 6, 0
+ $w6 = MOVZWi 7, 0
+ $w7 = MOVZWi 8, 0
+ BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit killed $w3, implicit killed $w4, implicit killed $w5, implicit killed $w6, implicit killed $w7, implicit-def $sp
+ BL @bar, csr_aarch64_aapcs_swifterror, implicit-def dead $lr, implicit $sp, implicit-def $sp
+ $w0 = MOVZWi 1, 0
+ $w1 = MOVZWi 2, 0
+ $w2 = MOVZWi 3, 0
+ $w3 = MOVZWi 4, 0
+ $w4 = MOVZWi 5, 0
+ $w5 = MOVZWi 6, 0
+ $w6 = MOVZWi 7, 0
+ $w7 = MOVZWi 8, 0
+ BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit killed $w3, implicit killed $w4, implicit killed $w5, implicit killed $w6, implicit killed $w7, implicit-def $sp
+ BL @bar, csr_aarch64_aapcs_swifterror, implicit-def dead $lr, implicit $sp, implicit-def $sp
+ $w0 = MOVZWi 1, 0
+ $w1 = MOVZWi 2, 0
+ $w2 = MOVZWi 3, 0
+ $w3 = MOVZWi 4, 0
+ $w4 = MOVZWi 5, 0
+ $w5 = MOVZWi 6, 0
+ $w6 = MOVZWi 7, 0
+ $w7 = MOVZWi 8, 0
+ BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit killed $w3, implicit killed $w4, implicit killed $w5, implicit killed $w6, implicit killed $w7, implicit-def $sp
+ BL @bar, csr_aarch64_aapcs_swifterror, implicit-def dead $lr, implicit $sp, implicit-def $sp
+ $sp = ADDXri $sp, 16, 0
+ RET undef $lr
|
6702929 to
42ce8b7
Compare
| SmallSet<Register, 2> InstrUseRegs; | ||
| for (MachineOperand &MOP : MI->operands()) { | ||
| // Collect all regmasks. Merge them in the end. | ||
| if (MOP.isRegMask()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you attempt to optimize the case with 1 just for copying the pointer. But I'd simply merging reg mask in place here, without using the extra set.
Basically, you could declare RegMaskSize outside the loop, as it's a constant for the function.
uint32_t *RegMask = nullptr;
...
if (MOP.isRegMask()) {
const uint32_t *CurrRegMask = MOP.getRegMask();
if (!RegMask) {
RegMask = MF->allocateRegMask();
memcpy(RegMask, CurrRegMask, sizeof(RegMask[0]) * RegMaskSize);
} else {
for (unsigned I = 0; I < RegMaskSize; ++I)
RegMask[I] &= CurrRegMask[I];
}
continue;
}
...
if (RegMask)
CallInst->addOperand(MachineOperand::CreateRegMask(RegMask));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe 1-regmask is the most common case and the memory allocation in MF->allocateRegMask() should be avoided if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of RegMasks depends on the number of call instructions in the outlined candidate, and I believe this usually one. However, I think you might be forgetting that non-call instruction can clobber a register. My understanding of RegMask is that it tells you which registers could be changed after executing that instruction.
llvm-project/llvm/include/llvm/CodeGen/MachineOperand.h
Lines 921 to 927 in 76dba2e
| /// A RegMask operand represents a set of non-clobbered physical registers | |
| /// on an instruction that clobbers many registers, typically a call. The | |
| /// bit mask has a bit set for each physreg that is preserved by this | |
| /// instruction, as described in the documentation for | |
| /// TargetRegisterInfo::getCallPreservedMask(). | |
| /// | |
| /// Any physreg with a 0 bit in the mask is clobbered by the instruction. |
But an outlined function might have other instructions as well as call instructions. I think you need to mark MOP.isDef() registers as clobbered in the mask too (I believe this is what @MatzeB was alluding to).
In that case, we will likely need to copy the mask using allocateRegMask() anyway, so I think @kyulee-com suggestion might be worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to mark registers from def operands in the mask, I have a follow-up question: should we drop all the def operands in favor of using regmask, or should we add the masked registers to the def list as well? My understanding is that regmask and def operands serve different purposes - regmask is primarily used to describe calling conventions, while def operands provide additional information specific to the callee. Including all defs in the regmask seems redundant to me. Could anyone clarify the intended distinction between these two, or share best practices for handling this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. It seems like RegMask and Def/Use registers hold similar information. I think the RegMask is responsible only to tell you which registers are clobbered after an instruction (usually a call). Def/Use registers are required to know which registers to pass as inputs to a function, and which to use a return values.
There is similar code in FastISel.cpp where it uses both Def/Use (In/Out) registers and a RegMask.
llvm-project/llvm/lib/CodeGen/SelectionDAG/FastISel.cpp
Lines 858 to 880 in 9a0e1c7
| // Push the arguments from the call instruction. | |
| for (auto Reg : CLI.OutRegs) | |
| Ops.push_back(MachineOperand::CreateReg(Reg, /*isDef=*/false)); | |
| // Push live variables for the stack map. | |
| if (!addStackMapLiveVars(Ops, I, NumMetaOpers + NumArgs)) | |
| return false; | |
| // Push the register mask info. | |
| Ops.push_back(MachineOperand::CreateRegMask( | |
| TRI.getCallPreservedMask(*FuncInfo.MF, CC))); | |
| // Add scratch registers as implicit def and early clobber. | |
| const MCPhysReg *ScratchRegs = TLI.getScratchRegisters(CC); | |
| for (unsigned i = 0; ScratchRegs[i]; ++i) | |
| Ops.push_back(MachineOperand::CreateReg( | |
| ScratchRegs[i], /*isDef=*/true, /*isImp=*/true, /*isKill=*/false, | |
| /*isDead=*/false, /*isUndef=*/false, /*isEarlyClobber=*/true)); | |
| // Add implicit defs (return values). | |
| for (auto Reg : CLI.InRegs) | |
| Ops.push_back(MachineOperand::CreateReg(Reg, /*isDef=*/true, | |
| /*isImp=*/true)); |
kyulee-com
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But wait for a few days to get reviews, if any, from others.
llvm/lib/CodeGen/MachineOutliner.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is too low level of an abstraction to be using to construct a new regmask. Can you move this into MachineFunction or some other appropriate place to provide a cleaner API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are copied from MachineFunction::allocateRegMask. Therefore, the most straightforward approach is to extend it while maintaining its behavior unchanged for existing users with default parameter values. If you have any better ideas, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also misleading. The 2nd parameter of memset is NOT an UINT32 but is an int and interpreted as a byte...
+1 on not modifying allocateRegMask in this way please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create another API like the one below and use it to redefine this for backward compatibility?
MutableArrayRef<uint32_t> allocateRegMaskArray(uint8_t InitValue = 0);
Additionally, looking at the usage of allocateRegMask(), there are many cases where it is used for cloning a RegMask. I think creating another API, as shown below, could utilize the above allocateRegMaskArray() to simplify the existing code. However, this might be better for a separate PR.
uint32_t *cloneRegMask(const uint32_t *RegMask);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's a step in the right direction but I think we can do better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change the existing allocatRegMask() function for this it doesn't need and shouldn't have it IMO! The use case here doesn't seem that special / widespread that it needs more abstractions IMO...
|
Not sure I have the full context here, but this feels like it is attempting to construct some clobber mask for an outlined call by merging all the register masks of the calls in the newly outlined function. That seems wrong to me as it completely ignores any other register usage (normal defs can clobber registers just as well as calls with clobbers can). So this seems incomplete/wrong to me as-is... |
|
I think it's necessary to pause and reach an agreement before proceeding with further changes. To answer @MatzeB 's question first: I apologize for not posting the link to the RFC earlier: https://discourse.llvm.org/t/loh-conflicting-with-machineoutliner/83279. Here’s a summary: I plan to move AArch64CollectLOH to addPreEmitPass2 to allow the outliner to optimize more. Historically AArch64CollectLOH has made certain assumptions about its input and has accepted some trade-offs for compile-time performance (some of which I believe is actually written by @MatzeB). These will be invalidated with the proposed change. What this PR addresses falls in the latter category: AArch64CollectLOH does not perform accurate register liveness analysis but only looks into regmasks and defs to determine if a given register is clobbered. MachineOutliner emits calls to outlined sequences, which probably contains calls to another function which clobbers caller-saved registers. This is not reflected in the defs but in the regmask. Therefore, the accuracy of regmask is key to accurate LOH emissions, which MachineOutliner lacks right now. Regarding concerns about the |
|
The effective set of Def and Use registers for the given range (the outlined instruction sequence) has already been computed by reverse traversing the instructions. Here, additionally it merges the reg masks that point to clobbering registers in call instructions. This ensures that all definitions are covered either in the Def registers or these combined reg masks. Conservatively, we could also set these bits for the Def registers in these masks, but I don't think it's necessary as we already have the Def registers mentioned above. @MatzeB Do you think the reg masks should have all Def register bits regardless? It's worth noting we need custom reg masks because the outlined function call does not follow a typical calling convention; instead, it's a naked call for the range of instructions without forming a frame. So, we should conservatively merge all clobbered register effects into the masks, which this PR seems to fix. Regarding my suggestion, |
a15ce9e to
7ed7fc4
Compare
|
Having been occupied with the CNY and other company-related matters, I have finally had the opportunity to work on this PR. Given that @MatzeB has not provided any further comments, I presume you have been familiar with the context. I have addressed all comments from the participating reviewers in the latest commit. |
When emitting calls to an outlined function, the register masks from the outlined sequence are lost. The AArch64CollectLOH pass, which I plan to move to PreEmitPass2 (positioned after MachineOutliner), relies on accurate register masks. This patch ensures that regmasks are correctly preserved in the outlined calls, maintaining the required accuracy for subsequent passes.
Co-authored-by: Ellis Hoag <[email protected]>
7ed7fc4 to
db2f85e
Compare
ellishg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in this comment: https://discourse.llvm.org/t/loh-conflicting-with-machineoutliner/83279/12, I don't think it's correct to move LOH after the MachineOutliner. Therefore this PR is not needed.
When emitting calls to an outlined function, the register masks from the outlined sequence are lost. The AArch64CollectLOH pass, which I plan to move to PreEmitPass2 (positioned after MachineOutliner), relies on accurate register masks. This patch ensures that regmasks are correctly preserved in the outlined calls, maintaining the required accuracy for subsequent passes.