Skip to content

Conversation

@nocchijiang
Copy link
Contributor

When emitting calls to an outlined function, the register masks from the outlined sequence are lost. The AArch64CollectLOH pass, which I plan to move to PreEmitPass2 (positioned after MachineOutliner), relies on accurate register masks. This patch ensures that regmasks are correctly preserved in the outlined calls, maintaining the required accuracy for subsequent passes.

@llvmbot
Copy link
Member

llvmbot commented Dec 23, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Zhaoxuan Jiang (nocchijiang)

Changes

When emitting calls to an outlined function, the register masks from the outlined sequence are lost. The AArch64CollectLOH pass, which I plan to move to PreEmitPass2 (positioned after MachineOutliner), relies on accurate register masks. This patch ensures that regmasks are correctly preserved in the outlined calls, maintaining the required accuracy for subsequent passes.


Full diff: https://github.com/llvm/llvm-project/pull/120940.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineOutliner.cpp (+25)
  • (added) llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir (+94)
diff --git a/llvm/lib/CodeGen/MachineOutliner.cpp b/llvm/lib/CodeGen/MachineOutliner.cpp
index 4c5489434c69bb..c01d9b04a89d93 100644
--- a/llvm/lib/CodeGen/MachineOutliner.cpp
+++ b/llvm/lib/CodeGen/MachineOutliner.cpp
@@ -1117,6 +1117,7 @@ bool MachineOutliner::outline(
         // instruction. It also updates call site information for moved
         // code.
         SmallSet<Register, 2> UseRegs, DefRegs;
+        SmallPtrSet<const uint32_t *, 2> RegMasks;
         // Copy over the defs in the outlined range.
         // First inst in outlined range <-- Anything that's defined in this
         // ...                           .. range has to be added as an
@@ -1130,6 +1131,12 @@ bool MachineOutliner::outline(
           MachineInstr *MI = &*Iter;
           SmallSet<Register, 2> InstrUseRegs;
           for (MachineOperand &MOP : MI->operands()) {
+            // Collect all regmasks. Merge them in the end.
+            if (MOP.isRegMask()) {
+              RegMasks.insert(MOP.getRegMask());
+              continue;
+            }
+
             // Skip over anything that isn't a register.
             if (!MOP.isReg())
               continue;
@@ -1153,6 +1160,24 @@ bool MachineOutliner::outline(
             MI->getMF()->eraseCallSiteInfo(MI);
         }
 
+        if (!RegMasks.empty()) {
+          if (RegMasks.size() == 1) {
+            CallInst->addOperand(
+                MachineOperand::CreateRegMask(*RegMasks.begin()));
+          } else {
+            uint32_t *RegMask = MF->allocateRegMask();
+            unsigned NumRegs =
+                MF->getSubtarget().getRegisterInfo()->getNumRegs();
+            unsigned Size = MachineOperand::getRegMaskSize(NumRegs);
+            memset(RegMask, UINT32_MAX, Size * sizeof(RegMask[0]));
+            for (const uint32_t *Mask : RegMasks) {
+              for (unsigned I = 0; I < NumRegs; ++I)
+                RegMask[I] &= Mask[I];
+            }
+            CallInst->addOperand(MachineOperand::CreateRegMask(RegMask));
+          }
+        }
+
         for (const Register &I : DefRegs)
           // If it's a def, add it to the call instruction.
           CallInst->addOperand(
diff --git a/llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir b/llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir
new file mode 100644
index 00000000000000..047a73f81dd2ae
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/machine-outliner-regmask.mir
@@ -0,0 +1,94 @@
+# RUN: llc -mtriple=aarch64-apple-ios -run-pass=prologepilog -run-pass=machine-outliner %s -o - | FileCheck %s
+--- |
+  declare swiftcc void @bar()
+  declare void @baz(i32, i32, i32) #0
+
+  define void @test_same_regmask() #0 {
+    ret void
+  }
+  define void @test_different_regmasks() #0 {
+    ret void
+  }
+  define void @foo(i32, i32, i32, i32, i32, i32, i32, i32) #0 {
+    ret void
+  }
+
+...
+---
+name:            foo
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    RET undef $lr
+
+
+...
+---
+name:            test_same_regmask
+tracksRegLiveness: true
+body:             |
+  ; CHECK-LABEL: name: test_same_regmask
+  ; CHECK:       bb.1:
+  ; CHECK-NEXT:    BL @OUTLINED_FUNCTION_1, csr_aarch64_aapcs
+
+  bb.0:
+    $sp = frame-setup SUBXri $sp, 16, 0
+
+  bb.1:
+    $w0 = MOVZWi 1, 0
+    $w1 = MOVZWi 2, 0
+    $w2 = MOVZWi 3, 0
+    BL @baz, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit-def $sp
+    $w0 = MOVZWi 1, 0
+    $w1 = MOVZWi 2, 0
+    $w2 = MOVZWi 3, 0
+    BL @baz, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit-def $sp
+    $sp = ADDXri $sp, 16, 0
+    RET undef $lr
+
+
+...
+---
+name:            test_different_regmasks
+tracksRegLiveness: true
+body:             |
+  ; CHECK-LABEL: name: test_different_regmasks
+  ; CHECK:       bb.1:
+  ; CHECK-NEXT:    BL @OUTLINED_FUNCTION_0, CustomRegMask($fp,$lr,$wzr,$wzr_hi,$xzr,$b8,$b9,$b10,$b11,$b12,$b13,$b14,$b15,$d8,$d9,$d10,$d11,$d12,$d13,$d14,$d15,$h8,$h9,$h10,$h11,$h12,$h13,$h14,$h15,$s8,$s9,$s10,$s11,$s12,$s13,$s14,$s15,$w19,$w20,$w22,$w23,$w24,$w25,$w26,$w27,$w28,$w29,$w30,$x19,$x20,$x22,$x23,$x24,$x25,$x26,$x27,$x28,$b8_hi,$b9_hi,$b10_hi,$b11_hi,$b12_hi,$b13_hi,$b14_hi,$b15_hi,$h8_hi,$h9_hi,$h10_hi,$h11_hi,$h12_hi,$h13_hi,$h14_hi,$h15_hi,$s8_hi,$s9_hi,$s10_hi,$s11_hi,$s12_hi,$s13_hi,$s14_hi,$s15_hi,$w19_hi,$w20_hi,$w22_hi,$w23_hi,$w24_hi,$w25_hi,$w26_hi,$w27_hi,$w28_hi,$w29_hi,$w30_hi,$d8_d9,$d9_d10,$d10_d11,$d11_d12,$d12_d13,$d13_d14,$d14_d15,$d8_d9_d10_d11,$d9_d10_d11_d12,$d10_d11_d12_d13,$d11_d12_d13_d14,$d12_d13_d14_d15,$d8_d9_d10,$d9_d10_d11,$d10_d11_d12,$d11_d12_d13,$d12_d13_d14,$d13_d14_d15,$x22_x23_x24_x25_x26_x27_x28_fp,$w22_w23,$w24_w25,$w26_w27,$w28_w29,$x28_fp,$x22_x23,$x24_x25,$x26_x27)
+
+  bb.0:
+    $sp = frame-setup SUBXri $sp, 16, 0
+
+  bb.1:
+    $w0 = MOVZWi 1, 0
+    $w1 = MOVZWi 2, 0
+    $w2 = MOVZWi 3, 0
+    $w3 = MOVZWi 4, 0
+    $w4 = MOVZWi 5, 0
+    $w5 = MOVZWi 6, 0
+    $w6 = MOVZWi 7, 0
+    $w7 = MOVZWi 8, 0
+    BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit killed $w3, implicit killed $w4, implicit killed $w5, implicit killed $w6, implicit killed $w7, implicit-def $sp
+    BL @bar, csr_aarch64_aapcs_swifterror, implicit-def dead $lr, implicit $sp, implicit-def $sp
+    $w0 = MOVZWi 1, 0
+    $w1 = MOVZWi 2, 0
+    $w2 = MOVZWi 3, 0
+    $w3 = MOVZWi 4, 0
+    $w4 = MOVZWi 5, 0
+    $w5 = MOVZWi 6, 0
+    $w6 = MOVZWi 7, 0
+    $w7 = MOVZWi 8, 0
+    BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit killed $w3, implicit killed $w4, implicit killed $w5, implicit killed $w6, implicit killed $w7, implicit-def $sp
+    BL @bar, csr_aarch64_aapcs_swifterror, implicit-def dead $lr, implicit $sp, implicit-def $sp
+    $w0 = MOVZWi 1, 0
+    $w1 = MOVZWi 2, 0
+    $w2 = MOVZWi 3, 0
+    $w3 = MOVZWi 4, 0
+    $w4 = MOVZWi 5, 0
+    $w5 = MOVZWi 6, 0
+    $w6 = MOVZWi 7, 0
+    $w7 = MOVZWi 8, 0
+    BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit killed $w0, implicit killed $w1, implicit killed $w2, implicit killed $w3, implicit killed $w4, implicit killed $w5, implicit killed $w6, implicit killed $w7, implicit-def $sp
+    BL @bar, csr_aarch64_aapcs_swifterror, implicit-def dead $lr, implicit $sp, implicit-def $sp
+    $sp = ADDXri $sp, 16, 0
+    RET undef $lr

@fhahn fhahn requested review from aemerson and ornata December 23, 2024 13:22
@ellishg ellishg requested a review from kyulee-com December 24, 2024 01:57
@nocchijiang nocchijiang force-pushed the outlined_function_call_regmask branch 2 times, most recently from 6702929 to 42ce8b7 Compare December 24, 2024 08:02
SmallSet<Register, 2> InstrUseRegs;
for (MachineOperand &MOP : MI->operands()) {
// Collect all regmasks. Merge them in the end.
if (MOP.isRegMask()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you attempt to optimize the case with 1 just for copying the pointer. But I'd simply merging reg mask in place here, without using the extra set.
Basically, you could declare RegMaskSize outside the loop, as it's a constant for the function.

uint32_t *RegMask = nullptr;
...
            if (MOP.isRegMask()) {
              const uint32_t *CurrRegMask = MOP.getRegMask();
              if (!RegMask) {
                RegMask = MF->allocateRegMask();
                memcpy(RegMask, CurrRegMask, sizeof(RegMask[0]) * RegMaskSize);
              } else {
                for (unsigned I = 0; I < RegMaskSize; ++I)
                  RegMask[I] &= CurrRegMask[I];
              }
              continue;
            }
...
            if (RegMask)
              CallInst->addOperand(MachineOperand::CreateRegMask(RegMask));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe 1-regmask is the most common case and the memory allocation in MF->allocateRegMask() should be avoided if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of RegMasks depends on the number of call instructions in the outlined candidate, and I believe this usually one. However, I think you might be forgetting that non-call instruction can clobber a register. My understanding of RegMask is that it tells you which registers could be changed after executing that instruction.

/// A RegMask operand represents a set of non-clobbered physical registers
/// on an instruction that clobbers many registers, typically a call. The
/// bit mask has a bit set for each physreg that is preserved by this
/// instruction, as described in the documentation for
/// TargetRegisterInfo::getCallPreservedMask().
///
/// Any physreg with a 0 bit in the mask is clobbered by the instruction.

But an outlined function might have other instructions as well as call instructions. I think you need to mark MOP.isDef() registers as clobbered in the mask too (I believe this is what @MatzeB was alluding to).

In that case, we will likely need to copy the mask using allocateRegMask() anyway, so I think @kyulee-com suggestion might be worth considering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to mark registers from def operands in the mask, I have a follow-up question: should we drop all the def operands in favor of using regmask, or should we add the masked registers to the def list as well? My understanding is that regmask and def operands serve different purposes - regmask is primarily used to describe calling conventions, while def operands provide additional information specific to the callee. Including all defs in the regmask seems redundant to me. Could anyone clarify the intended distinction between these two, or share best practices for handling this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. It seems like RegMask and Def/Use registers hold similar information. I think the RegMask is responsible only to tell you which registers are clobbered after an instruction (usually a call). Def/Use registers are required to know which registers to pass as inputs to a function, and which to use a return values.

There is similar code in FastISel.cpp where it uses both Def/Use (In/Out) registers and a RegMask.

// Push the arguments from the call instruction.
for (auto Reg : CLI.OutRegs)
Ops.push_back(MachineOperand::CreateReg(Reg, /*isDef=*/false));
// Push live variables for the stack map.
if (!addStackMapLiveVars(Ops, I, NumMetaOpers + NumArgs))
return false;
// Push the register mask info.
Ops.push_back(MachineOperand::CreateRegMask(
TRI.getCallPreservedMask(*FuncInfo.MF, CC)));
// Add scratch registers as implicit def and early clobber.
const MCPhysReg *ScratchRegs = TLI.getScratchRegisters(CC);
for (unsigned i = 0; ScratchRegs[i]; ++i)
Ops.push_back(MachineOperand::CreateReg(
ScratchRegs[i], /*isDef=*/true, /*isImp=*/true, /*isKill=*/false,
/*isDead=*/false, /*isUndef=*/false, /*isEarlyClobber=*/true));
// Add implicit defs (return values).
for (auto Reg : CLI.InRegs)
Ops.push_back(MachineOperand::CreateReg(Reg, /*isDef=*/true,
/*isImp=*/true));

Copy link
Contributor

@kyulee-com kyulee-com left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But wait for a few days to get reviews, if any, from others.

@kyulee-com
Copy link
Contributor

This seems a good fix. Any other comments? @aemerson @ornata @MatzeB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is too low level of an abstraction to be using to construct a new regmask. Can you move this into MachineFunction or some other appropriate place to provide a cleaner API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are copied from MachineFunction::allocateRegMask. Therefore, the most straightforward approach is to extend it while maintaining its behavior unchanged for existing users with default parameter values. If you have any better ideas, please let me know.

Copy link
Contributor

@MatzeB MatzeB Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also misleading. The 2nd parameter of memset is NOT an UINT32 but is an int and interpreted as a byte...

+1 on not modifying allocateRegMask in this way please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create another API like the one below and use it to redefine this for backward compatibility?

MutableArrayRef<uint32_t> allocateRegMaskArray(uint8_t InitValue = 0);

Additionally, looking at the usage of allocateRegMask(), there are many cases where it is used for cloning a RegMask. I think creating another API, as shown below, could utilize the above allocateRegMaskArray() to simplify the existing code. However, this might be better for a separate PR.

uint32_t *cloneRegMask(const uint32_t *RegMask);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's a step in the right direction but I think we can do better.

Copy link
Contributor

@MatzeB MatzeB Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change the existing allocatRegMask() function for this it doesn't need and shouldn't have it IMO! The use case here doesn't seem that special / widespread that it needs more abstractions IMO...

@MatzeB
Copy link
Contributor

MatzeB commented Jan 2, 2025

Not sure I have the full context here, but this feels like it is attempting to construct some clobber mask for an outlined call by merging all the register masks of the calls in the newly outlined function. That seems wrong to me as it completely ignores any other register usage (normal defs can clobber registers just as well as calls with clobbers can). So this seems incomplete/wrong to me as-is...

@nocchijiang
Copy link
Contributor Author

I think it's necessary to pause and reach an agreement before proceeding with further changes.

To answer @MatzeB 's question first: I apologize for not posting the link to the RFC earlier: https://discourse.llvm.org/t/loh-conflicting-with-machineoutliner/83279. Here’s a summary: I plan to move AArch64CollectLOH to addPreEmitPass2 to allow the outliner to optimize more. Historically AArch64CollectLOH has made certain assumptions about its input and has accepted some trade-offs for compile-time performance (some of which I believe is actually written by @MatzeB). These will be invalidated with the proposed change.

What this PR addresses falls in the latter category: AArch64CollectLOH does not perform accurate register liveness analysis but only looks into regmasks and defs to determine if a given register is clobbered. MachineOutliner emits calls to outlined sequences, which probably contains calls to another function which clobbers caller-saved registers. This is not reflected in the defs but in the regmask. Therefore, the accuracy of regmask is key to accurate LOH emissions, which MachineOutliner lacks right now.

Regarding concerns about the allocateRegMask refactoring: regmask has been seriously lacking proper abstraction and encapsulation. A simple global search on allocateRegMask reveals a lot of instances of implementation detail leaks. We need to clearly define the aspects we want to improve in this PR; otherwise, it could become endless and extend far beyond its scope. MutableArrayRef<uint32_t> suggested by @kyulee-com indeed avoids explicit size reporting but offers no improvement in other aspects (initial value setting, cloning, manipulation, etc.).

@kyulee-com
Copy link
Contributor

kyulee-com commented Jan 3, 2025

The effective set of Def and Use registers for the given range (the outlined instruction sequence) has already been computed by reverse traversing the instructions. Here, additionally it merges the reg masks that point to clobbering registers in call instructions. This ensures that all definitions are covered either in the Def registers or these combined reg masks. Conservatively, we could also set these bits for the Def registers in these masks, but I don't think it's necessary as we already have the Def registers mentioned above. @MatzeB Do you think the reg masks should have all Def register bits regardless?

It's worth noting we need custom reg masks because the outlined function call does not follow a typical calling convention; instead, it's a naked call for the range of instructions without forming a frame. So, we should conservatively merge all clobbered register effects into the masks, which this PR seems to fix.

Regarding my suggestion, MutableArrayRef<uint32_t> allocateRegMaskArray(uint8_t InitValue = 0);, I didn't intend to modify the API of the existing uint32_t *allocateRegMask();. Instead, it can be implemented in a way that preserves the existing behavior while also providing the array size as needed in this context here. @MatzeB @aemerson What do you suggest here?

uint32_t *allocateRegMask() { return allocateRegMaskArray().data(); }

@nocchijiang nocchijiang force-pushed the outlined_function_call_regmask branch 3 times, most recently from a15ce9e to 7ed7fc4 Compare February 19, 2025 11:40
@nocchijiang
Copy link
Contributor Author

Having been occupied with the CNY and other company-related matters, I have finally had the opportunity to work on this PR. Given that @MatzeB has not provided any further comments, I presume you have been familiar with the context. I have addressed all comments from the participating reviewers in the latest commit.

@nocchijiang nocchijiang requested a review from aemerson February 19, 2025 11:43
nocchijiang and others added 4 commits May 12, 2025 15:34
When emitting calls to an outlined function, the register masks from the
outlined sequence are lost. The AArch64CollectLOH pass, which I plan to
move to PreEmitPass2 (positioned after MachineOutliner), relies on
accurate register masks. This patch ensures that regmasks are correctly
preserved in the outlined calls, maintaining the required accuracy for
subsequent passes.
Co-authored-by: Ellis Hoag <[email protected]>
@nocchijiang nocchijiang force-pushed the outlined_function_call_regmask branch from 7ed7fc4 to db2f85e Compare May 12, 2025 07:35
@nocchijiang
Copy link
Contributor Author

@aemerson @MatzeB ping

Copy link
Contributor

@ellishg ellishg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in this comment: https://discourse.llvm.org/t/loh-conflicting-with-machineoutliner/83279/12, I don't think it's correct to move LOH after the MachineOutliner. Therefore this PR is not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants