Skip to content

Conversation

@MacDue
Copy link
Member

@MacDue MacDue commented Jun 2, 2025

For a while we have supported the -aarch64-stack-hazard-size=<size> option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by -pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>), and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|   <hazard padding>                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   <hazard padding>                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

This layout is only enabled if:

  • SplitSVEObjects are enabled (-aarch64-split-sve-objects)
    • (This may be enabled by default in a later patch)
  • Streaming memory hazards are present
    • (-aarch64-stack-hazard-size=<val> != 0)
  • PPRs and FPRs/ZPRs are on the stack
  • There's no stack realignment or variable-sized objects
    • This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.

@MacDue
Copy link
Member Author

MacDue commented Jun 2, 2025

@MacDue MacDue marked this pull request as ready for review June 2, 2025 14:00
@llvmbot
Copy link
Member

llvmbot commented Jun 2, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

For a while we have supported the -aarch64-stack-hazard-size=&lt;size&gt; option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by -pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=&lt;val&gt;), and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| &lt;- fp(=x29)
|   &lt;hazard padding&gt;                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   &lt;FPR&gt;                           |
|   &lt;hazard padding&gt;                |
|   &lt;GPR&gt;                           |
------------------------------------| &lt;- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| &lt;- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   &lt;hazard padding&gt;                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   &lt;FPR&gt;                           |
|   &lt;hazard padding&gt;                |
|   &lt;GPR&gt;                           |
------------------------------------| &lt;- sp
                                    | Lower address

This layout is only enabled if:

  • SplitSVEObjects are enabled (-aarch64-split-sve-objects)
    • (This may be enabled by default in a later patch)
  • Streaming memory hazards are present
    • (-aarch64-stack-hazard-size=&lt;val&gt; != 0)
  • PPRs and FPRs/ZPRs are on the stack
  • There's no stack realignment or variable-sized objects
    • This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.


Patch is 172.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/142392.diff

8 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64FrameLowering.cpp (+391-133)
  • (modified) llvm/lib/Target/AArch64/AArch64FrameLowering.h (+2-1)
  • (modified) llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h (+7)
  • (added) llvm/test/CodeGen/AArch64/framelayout-split-sve.mir (+526)
  • (modified) llvm/test/CodeGen/AArch64/spill-fill-zpr-predicates.mir (+7-10)
  • (added) llvm/test/CodeGen/AArch64/split-sve-stack-frame-layout.ll (+751)
  • (modified) llvm/test/CodeGen/AArch64/stack-hazard.ll (+542-326)
  • (modified) llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll (-2)
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index e5592a921e192..36775b8dc05e5 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -270,6 +270,11 @@ static cl::opt<bool> OrderFrameObjects("aarch64-order-frame-objects",
                                        cl::desc("sort stack allocations"),
                                        cl::init(true), cl::Hidden);
 
+static cl::opt<bool>
+    SplitSVEObjects("aarch64-split-sve-objects",
+                    cl::desc("Split allocation of ZPR & PPR objects"),
+                    cl::init(false), cl::Hidden);
+
 cl::opt<bool> EnableHomogeneousPrologEpilog(
     "homogeneous-prolog-epilog", cl::Hidden,
     cl::desc("Emit homogeneous prologue and epilogue for the size "
@@ -458,10 +463,13 @@ static StackOffset getZPRStackSize(const MachineFunction &MF) {
   return StackOffset::getScalable(AFI->getStackSizeZPR());
 }
 
-/// Returns the size of the entire PPR stackframe (calleesaves + spills).
+/// Returns the size of the entire PPR stackframe (calleesaves + spills + hazard
+/// padding).
 static StackOffset getPPRStackSize(const MachineFunction &MF) {
   const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
-  return StackOffset::getScalable(AFI->getStackSizePPR());
+  return StackOffset::get(AFI->hasSplitSVEObjects() ? getStackHazardSize(MF)
+                                                    : 0,
+                          AFI->getStackSizePPR());
 }
 
 /// Returns the size of the entire SVE stackframe (PPRs + ZPRs).
@@ -484,6 +492,10 @@ bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
   if (!EnableRedZone)
     return false;
 
+  const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
+  if (AFI->hasSplitSVEObjects())
+    return false;
+
   // Don't use the red zone if the function explicitly asks us not to.
   // This is typically used for kernel code.
   const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
@@ -493,7 +505,6 @@ bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
     return false;
 
   const MachineFrameInfo &MFI = MF.getFrameInfo();
-  const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
   uint64_t NumBytes = AFI->getLocalStackSize();
 
   // If neither NEON or SVE are available, a COPY from one Q-reg to
@@ -666,9 +677,11 @@ void AArch64FrameLowering::emitCalleeSavedSVELocations(
   const TargetRegisterInfo &TRI = *STI.getRegisterInfo();
   AArch64FunctionInfo &AFI = *MF.getInfo<AArch64FunctionInfo>();
   CFIInstBuilder CFIBuilder(MBB, MBBI, MachineInstr::FrameSetup);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
 
   for (const auto &Info : CSI) {
-    if (!MFI.isScalableStackID(Info.getFrameIdx()))
+    int FI = Info.getFrameIdx();
+    if (!MFI.isScalableStackID(FI))
       continue;
 
     // Not all unwinders may know about SVE registers, so assume the lowest
@@ -679,9 +692,13 @@ void AArch64FrameLowering::emitCalleeSavedSVELocations(
       continue;
 
     StackOffset Offset =
-        StackOffset::getScalable(MFI.getObjectOffset(Info.getFrameIdx())) -
+        StackOffset::getScalable(MFI.getObjectOffset(FI)) -
         StackOffset::getFixed(AFI.getCalleeSavedStackSize(MFI));
 
+    if (AFI.hasSplitSVEObjects() &&
+        MFI.getStackID(FI) == TargetStackID::ScalableVector)
+      Offset -= PPRStackSize;
+
     CFIBuilder.insertCFIInst(createCFAOffset(TRI, Reg, Offset));
   }
 }
@@ -2188,33 +2205,68 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes);
   StackOffset LocalsSize =
       PPRLocalsSize + ZPRLocalsSize + StackOffset::getFixed(NumBytes);
-  StackOffset SVECalleeSavesSize = PPRCalleeSavesSize + ZPRCalleeSavesSize;
-  MachineBasicBlock::iterator CalleeSavesBegin =
-      AFI->getPPRCalleeSavedStackSize() ? PPRCalleeSavesBegin
-                                        : ZPRCalleeSavesBegin;
-  allocateStackSpace(MBB, CalleeSavesBegin, 0, SVECalleeSavesSize, false,
-                     nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
-                     MFI.hasVarSizedObjects() || LocalsSize);
-  CFAOffset += SVECalleeSavesSize;
-
-  MachineBasicBlock::iterator CalleeSavesEnd =
-      AFI->getZPRCalleeSavedStackSize() ? ZPRCalleeSavesEnd : PPRCalleeSavesEnd;
-  if (EmitAsyncCFI)
-    emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
-
-  // Allocate space for the rest of the frame including SVE locals. Align the
-  // stack as necessary.
-  assert(!(canUseRedZone(MF) && NeedsRealignment) &&
-         "Cannot use redzone with stack realignment");
-  if (!canUseRedZone(MF)) {
-    // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
-    // the correct value here, as NumBytes also includes padding bytes,
-    // which shouldn't be counted here.
-    StackOffset SVELocalsSize = PPRLocalsSize + ZPRLocalsSize;
-    allocateStackSpace(MBB, CalleeSavesEnd, RealignmentPadding,
-                       SVELocalsSize + StackOffset::getFixed(NumBytes),
-                       NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
-                       CFAOffset, MFI.hasVarSizedObjects());
+  if (!AFI->hasSplitSVEObjects()) {
+    StackOffset SVECalleeSavesSize = PPRCalleeSavesSize + ZPRCalleeSavesSize;
+    MachineBasicBlock::iterator CalleeSavesBegin =
+        AFI->getPPRCalleeSavedStackSize() ? PPRCalleeSavesBegin
+                                          : ZPRCalleeSavesBegin;
+    allocateStackSpace(MBB, CalleeSavesBegin, 0, SVECalleeSavesSize, false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || LocalsSize);
+    CFAOffset += SVECalleeSavesSize;
+
+    MachineBasicBlock::iterator CalleeSavesEnd =
+        AFI->getZPRCalleeSavedStackSize() ? ZPRCalleeSavesEnd
+                                          : PPRCalleeSavesEnd;
+    if (EmitAsyncCFI)
+      emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
+
+    // Allocate space for the rest of the frame including SVE locals. Align the
+    // stack as necessary.
+    assert(!(canUseRedZone(MF) && NeedsRealignment) &&
+           "Cannot use redzone with stack realignment");
+    if (!canUseRedZone(MF)) {
+      // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
+      // the correct value here, as NumBytes also includes padding bytes,
+      // which shouldn't be counted here.
+      StackOffset SVELocalsSize = PPRLocalsSize + ZPRLocalsSize;
+      allocateStackSpace(MBB, CalleeSavesEnd, RealignmentPadding,
+                         SVELocalsSize + StackOffset::getFixed(NumBytes),
+                         NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
+                         CFAOffset, MFI.hasVarSizedObjects());
+    }
+  } else {
+    assert(!canUseRedZone(MF) &&
+           "Cannot use redzone with aarch64-split-sve-objects");
+    // TODO: Handle HasWinCFI/NeedsWinCFI?
+    assert(!NeedsWinCFI &&
+           "WinCFI with aarch64-split-sve-objects is not supported");
+
+    // Split ZPR and PPR allocation.
+    // Allocate PPR callee saves
+    allocateStackSpace(MBB, PPRCalleeSavesBegin, 0, PPRCalleeSavesSize, false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || ZPRCalleeSavesSize ||
+                           ZPRLocalsSize || PPRLocalsSize);
+    CFAOffset += PPRCalleeSavesSize;
+
+    // Allocate PPR locals + ZPR callee saves
+    assert(PPRCalleeSavesEnd == ZPRCalleeSavesBegin &&
+           "Expected ZPR callee saves after PPR locals");
+    allocateStackSpace(MBB, PPRCalleeSavesEnd, RealignmentPadding,
+                       PPRLocalsSize + ZPRCalleeSavesSize, false, nullptr,
+                       EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || ZPRLocalsSize);
+    CFAOffset += PPRLocalsSize + ZPRCalleeSavesSize;
+
+    // Allocate ZPR locals
+    allocateStackSpace(MBB, ZPRCalleeSavesEnd, RealignmentPadding,
+                       ZPRLocalsSize + StackOffset::getFixed(NumBytes), false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects());
+
+    if (EmitAsyncCFI)
+      emitCalleeSavedSVELocations(MBB, ZPRCalleeSavesEnd);
   }
 
   // If we need a base pointer, set it up here. It's whatever the value of the
@@ -2456,7 +2508,9 @@ void AArch64FrameLowering::emitEpilogue(MachineFunction &MF,
     }
   }
 
-  StackOffset SVEStackSize = getSVEStackSize(MF);
+  StackOffset ZPRStackSize = getZPRStackSize(MF);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
+  StackOffset SVEStackSize = ZPRStackSize + PPRStackSize;
 
   // If there is a single SP update, insert it before the ret and we're done.
   if (CombineSPBump) {
@@ -2477,69 +2531,141 @@ void AArch64FrameLowering::emitEpilogue(MachineFunction &MF,
   NumBytes -= PrologueSaveSize;
   assert(NumBytes >= 0 && "Negative stack allocation size!?");
 
-  // Process the SVE callee-saves to determine what space needs to be
-  // deallocated.
-  StackOffset DeallocateBefore = {}, DeallocateAfter = SVEStackSize;
-  MachineBasicBlock::iterator RestoreBegin = LastPopI, RestoreEnd = LastPopI;
-  int64_t ZPRCalleeSavedSize = AFI->getZPRCalleeSavedStackSize();
-  int64_t PPRCalleeSavedSize = AFI->getPPRCalleeSavedStackSize();
-  int64_t SVECalleeSavedSize = ZPRCalleeSavedSize + PPRCalleeSavedSize;
-
-  if (SVECalleeSavedSize) {
-    RestoreBegin = std::prev(RestoreEnd);
-    while (RestoreBegin != MBB.begin() &&
-           IsSVECalleeSave(std::prev(RestoreBegin)))
-      --RestoreBegin;
-
-    assert(IsSVECalleeSave(RestoreBegin) &&
-           IsSVECalleeSave(std::prev(RestoreEnd)) && "Unexpected instruction");
-
-    StackOffset CalleeSavedSizeAsOffset =
-        StackOffset::getScalable(SVECalleeSavedSize);
-    DeallocateBefore = SVEStackSize - CalleeSavedSizeAsOffset;
-    DeallocateAfter = CalleeSavedSizeAsOffset;
-  }
-
-  // Deallocate the SVE area.
-  if (SVEStackSize) {
-    // If we have stack realignment or variable sized objects on the stack,
-    // restore the stack pointer from the frame pointer prior to SVE CSR
-    // restoration.
-    if (AFI->isStackRealigned() || MFI.hasVarSizedObjects()) {
-      if (SVECalleeSavedSize) {
-        // Set SP to start of SVE callee-save area from which they can
-        // be reloaded. The code below will deallocate the stack space
-        // space by moving FP -> SP.
-        emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::FP,
-                        StackOffset::getScalable(-SVECalleeSavedSize), TII,
-                        MachineInstr::FrameDestroy);
-      }
-    } else {
-      if (SVECalleeSavedSize) {
-        // Deallocate the non-SVE locals first before we can deallocate (and
-        // restore callee saves) from the SVE area.
-        emitFrameOffset(
-            MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
-            StackOffset::getFixed(NumBytes), TII, MachineInstr::FrameDestroy,
-            false, false, nullptr, EmitCFI && !hasFP(MF),
-            SVEStackSize + StackOffset::getFixed(NumBytes + PrologueSaveSize));
-        NumBytes = 0;
+  if (!AFI->hasSplitSVEObjects()) {
+    // Process the SVE callee-saves to determine what space needs to be
+    // deallocated.
+    StackOffset DeallocateBefore = {}, DeallocateAfter = SVEStackSize;
+    MachineBasicBlock::iterator RestoreBegin = LastPopI, RestoreEnd = LastPopI;
+    int64_t ZPRCalleeSavedSize = AFI->getZPRCalleeSavedStackSize();
+    int64_t PPRCalleeSavedSize = AFI->getPPRCalleeSavedStackSize();
+    int64_t SVECalleeSavedSize = ZPRCalleeSavedSize + PPRCalleeSavedSize;
+
+    if (SVECalleeSavedSize) {
+      RestoreBegin = std::prev(RestoreEnd);
+      while (RestoreBegin != MBB.begin() &&
+             IsSVECalleeSave(std::prev(RestoreBegin)))
+        --RestoreBegin;
+
+      assert(IsSVECalleeSave(RestoreBegin) &&
+             IsSVECalleeSave(std::prev(RestoreEnd)) &&
+             "Unexpected instruction");
+
+      StackOffset CalleeSavedSizeAsOffset =
+          StackOffset::getScalable(SVECalleeSavedSize);
+      DeallocateBefore = SVEStackSize - CalleeSavedSizeAsOffset;
+      DeallocateAfter = CalleeSavedSizeAsOffset;
+    }
+
+    // Deallocate the SVE area.
+    if (SVEStackSize) {
+      // If we have stack realignment or variable sized objects on the stack,
+      // restore the stack pointer from the frame pointer prior to SVE CSR
+      // restoration.
+      if (AFI->isStackRealigned() || MFI.hasVarSizedObjects()) {
+        if (SVECalleeSavedSize) {
+          // Set SP to start of SVE callee-save area from which they can
+          // be reloaded. The code below will deallocate the stack space
+          // space by moving FP -> SP.
+          emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::FP,
+                          StackOffset::getScalable(-SVECalleeSavedSize), TII,
+                          MachineInstr::FrameDestroy);
+        }
+      } else {
+        if (SVECalleeSavedSize) {
+          // Deallocate the non-SVE locals first before we can deallocate (and
+          // restore callee saves) from the SVE area.
+          emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
+                          StackOffset::getFixed(NumBytes), TII,
+                          MachineInstr::FrameDestroy, false, false, nullptr,
+                          EmitCFI && !hasFP(MF),
+                          SVEStackSize + StackOffset::getFixed(
+                                             NumBytes + PrologueSaveSize));
+          NumBytes = 0;
+        }
+
+        emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
+                        DeallocateBefore, TII, MachineInstr::FrameDestroy,
+                        false, false, nullptr, EmitCFI && !hasFP(MF),
+                        SVEStackSize +
+                            StackOffset::getFixed(NumBytes + PrologueSaveSize));
+
+        emitFrameOffset(MBB, RestoreEnd, DL, AArch64::SP, AArch64::SP,
+                        DeallocateAfter, TII, MachineInstr::FrameDestroy, false,
+                        false, nullptr, EmitCFI && !hasFP(MF),
+                        DeallocateAfter +
+                            StackOffset::getFixed(NumBytes + PrologueSaveSize));
       }
+      if (EmitCFI)
+        emitCalleeSavedSVERestores(MBB, RestoreEnd);
+    }
+  } else if (SVEStackSize) {
+    assert(!AFI->isStackRealigned() && !MFI.hasVarSizedObjects() &&
+           "TODO: Support stack realigment / variable-sized objects");
+    // SplitSVEObjects. Determine the sizes and starts/ends of the ZPR and PPR
+    // areas.
+    auto ZPRCalleeSavedSize =
+        StackOffset::getScalable(AFI->getZPRCalleeSavedStackSize());
+    auto PPRCalleeSavedSize =
+        StackOffset::getScalable(AFI->getPPRCalleeSavedStackSize());
+    StackOffset PPRLocalsSize = PPRStackSize - PPRCalleeSavedSize;
+    StackOffset ZPRLocalsSize = ZPRStackSize - ZPRCalleeSavedSize;
+
+    MachineBasicBlock::iterator PPRRestoreBegin = LastPopI,
+                                PPRRestoreEnd = LastPopI;
+    if (PPRCalleeSavedSize) {
+      PPRRestoreBegin = std::prev(PPRRestoreEnd);
+      while (PPRRestoreBegin != MBB.begin() &&
+             IsPPRCalleeSave(std::prev(PPRRestoreBegin)))
+        --PPRRestoreBegin;
+    }
+
+    MachineBasicBlock::iterator ZPRRestoreBegin = PPRRestoreBegin,
+                                ZPRRestoreEnd = PPRRestoreBegin;
+    if (ZPRCalleeSavedSize) {
+      ZPRRestoreBegin = std::prev(ZPRRestoreEnd);
+      while (ZPRRestoreBegin != MBB.begin() &&
+             IsZPRCalleeSave(std::prev(ZPRRestoreBegin)))
+        --ZPRRestoreBegin;
+    }
+
+    auto CFAOffset =
+        SVEStackSize + StackOffset::getFixed(NumBytes + PrologueSaveSize);
+    if (PPRCalleeSavedSize || ZPRCalleeSavedSize) {
+      // Deallocate the non-SVE locals first before we can deallocate (and
+      // restore callee saves) from the SVE area.
+      auto NonSVELocals = StackOffset::getFixed(NumBytes);
+      emitFrameOffset(MBB, ZPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      NonSVELocals, TII, MachineInstr::FrameDestroy, false,
+                      false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+      NumBytes = 0;
+      CFAOffset -= NonSVELocals;
+    }
+
+    if (ZPRLocalsSize) {
+      emitFrameOffset(MBB, ZPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      ZPRLocalsSize, TII, MachineInstr::FrameDestroy, false,
+                      false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+      CFAOffset -= ZPRLocalsSize;
+    }
 
-      emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
-                      DeallocateBefore, TII, MachineInstr::FrameDestroy, false,
-                      false, nullptr, EmitCFI && !hasFP(MF),
-                      SVEStackSize +
-                          StackOffset::getFixed(NumBytes + PrologueSaveSize));
-
-      emitFrameOffset(MBB, RestoreEnd, DL, AArch64::SP, AArch64::SP,
-                      DeallocateAfter, TII, MachineInstr::FrameDestroy, false,
-                      false, nullptr, EmitCFI && !hasFP(MF),
-                      DeallocateAfter +
-                          StackOffset::getFixed(NumBytes + PrologueSaveSize));
+    if (PPRLocalsSize || ZPRCalleeSavedSize) {
+      assert(PPRRestoreBegin == ZPRRestoreEnd &&
+             "Expected PPR restores after ZPR");
+      emitFrameOffset(MBB, PPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      PPRLocalsSize + ZPRCalleeSavedSize, TII,
+                      MachineInstr::FrameDestroy, false, false, nullptr,
+                      EmitCFI && !hasFP(MF), CFAOffset);
+      CFAOffset -= PPRLocalsSize + ZPRCalleeSavedSize;
     }
+    if (PPRCalleeSavedSize) {
+      emitFrameOffset(MBB, PPRRestoreEnd, DL, AArch64::SP, AArch64::SP,
+                      PPRCalleeSavedSize, TII, MachineInstr::FrameDestroy,
+                      false, false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+    }
+
+    // We only emit CFI information for ZPRs so emit CFI after the ZPR restores.
     if (EmitCFI)
-      emitCalleeSavedSVERestores(MBB, RestoreEnd);
+      emitCalleeSavedSVERestores(MBB, ZPRRestoreEnd);
   }
 
   if (!hasFP(MF)) {
@@ -2660,9 +2786,15 @@ AArch64FrameLowering::getFrameIndexReferenceFromSP(const MachineFunction &MF,
     return StackOffset::getFixed(ObjectOffset - getOffsetOfLocalArea());
 
   const auto *AFI = MF.getInfo<AArch64FunctionInfo>();
-  if (MFI.isScalableStackID(FI))
-    return StackOffset::get(-((int64_t)AFI->getCalleeSavedStackSize()),
+  if (MFI.isScalableStackID(FI)) {
+    StackOffset AccessOffset{};
+    if (AFI->hasSplitSVEObjects() &&
+        MFI.getStackID(FI) == TargetStackID::ScalableVector)
+      AccessOffset = -PPRStackSize;
+    return AccessOffset +
+           StackOffset::get(-((int64_t)AFI->getCalleeSavedStackSize()),
                             ObjectOffset);
+  }
 
   bool IsFixed = MFI.isFixedObjectIndex(FI);
   bool IsCSR =
@@ -2720,12 +2852,12 @@ StackOffset AArch64FrameLowering::resolveFrameIndexReference(
   bool isFixed = MFI.isFixedObjectIndex(FI);
   bool isSVE = MFI.isScalableStackID(FI);
   return resolveFrameOffsetReference(MF, ObjectOffset, isFixed, isSVE, FrameReg,
-                                     PreferFP, ForSimm);
+                                     PreferFP, ForSimm, FI);
 }
 
 StackOffset AArch64FrameLowering::resolveFrameOffsetReference(
     const MachineFunction &MF, int64_t ObjectOffset, bool isFixed, bool isSVE,
-    Register &FrameReg, bool PreferFP, bool ForSimm) const {
+    Register &FrameReg, bool PreferFP, bool ForSimm, int64_t FI) const {
   const auto &MFI = MF.getFrameInfo();
   const auto *RegInfo = static_cast<const AArch64RegisterInfo *>(
       MF.getSubtarget().getRegisterInfo());
@@ -2737,7 +2869,9 @@ StackOffset AArch64FrameLowering::resolveFrameOffsetReference(
   bool isCSR =
       !isFixed && ObjectOffset >= -((int)AFI->getCalleeSavedStackSize(MFI));
 
-  const StackOffset SVEStackSize = getSVEStackSize(MF);
+  StackOffset ZPRStackSize = getZPRStackSize(MF);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
+  StackOffset SVEStackSize = ZPRStackSize + PPRStackSize;
 
   // Use frame pointer to reference fixed object...
[truncated]

@efriedma-quic
Copy link
Collaborator

In the implementation you're interested in, is there not a hazard between PPRs and GPRs?

What's the interaction between this and aarch64-enable-zpr-predicate-spills?

@MacDue
Copy link
Member Author

MacDue commented Jun 3, 2025

In the implementation you're interested in, is there not a hazard between PPRs and GPRs?

There are no hazards between PPRs and GPRs (those types of memory accesses can both be considered as occurring on the CPU).

What's the interaction between this and aarch64-enable-zpr-predicate-spills?

It cannot be enabled at the same time. We intend to remove aarch64-enable-zpr-predicate-spills after this lands (as it was a stopgap solution).

@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from 0b09392 to 42af819 Compare June 3, 2025 15:15
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from 9052fbd to bd86d4a Compare June 3, 2025 15:15
@MacDue
Copy link
Member Author

MacDue commented Jun 3, 2025

Rebased this PR stack on the changes from #138609... Which makes things even hairier 😅 It would be nice if all these modes were not so intertwined in the code.

I may look into if there's someway to move some of the logic into some kind of SVEFrameLayout class with overridable hooks to tidy some of this up.

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you express the size of the hazard padding between the PPRs and ZPRs as a scalable size, that might simplify some of the logic? You wouldn't need to represent the two areas as separate stacks, at least.

Maybe it's also worth considering "preallocating" some stack slots, along the lines of LocalStackSlotAllocation; anything we can allocate in an earlier pass doesn't have to be part of the core stack layout, and we probably get better code if we have a "base" for ZPR variables.

Do we care at all how stack protectors fit into all of this? I suspect the stack protector ends up in an unfortunate position by default.

HasFPRStackObjects |= SlotTypes[FI] == SlotType::ZPRorFPR;
// For SplitSVEObjects remember that this stack slot is a predicate, this
// will be needed later when determining the frame layout.
if (SlotTypes[FI] == SlotType::PPR) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right. Anything with scalable size needs to be either ScalablePredVector or ScalableVector. Anything with non-scalable size needs to be on the normal stack, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bit does not look right? SlotType::PPR (not to be confused with SlotType::GPR) is only set if the original stack ID was scalable and all accesses to that slot used predicate load/stores. The original stack ID could be ScalableVector, as the earlier selection of the stack ID is only based on the type size. Therefore, we change it to ScalablePredVector here so that it can be sorted into the correct region.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got confused by this bit:

        if (IsScalable || AArch64InstrInfo::isFpOrNEON(MI)) {
          SlotTypes[*FI] |= IsPPR ? SlotType::PPR : SlotType::ZPRorFPR;
        } else {
          SlotTypes[*FI] |= SlotType::GPR;
        }

Maybe rewrite it as the following, to split the cases a bit more cleanly?

if (IsScalable) {
  SlotTypes[*FI] |= isPPRAccess(MI) ? SlotType::PPR : SlotType::ZPRorFPR;
} else {
  SlotTypes[*FI] |= AArch64InstrInfo::isFpOrNEON(MI) ? SlotType::ZPRorFPR : SlotType::GPR;
}

Also, I'm not sure isFpOrNEON is quite sufficient here; we probably also want to handle SVE accesses to fixed-width slots.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rewrite it as the following, to split the cases a bit more cleanly?

👍 I've tidied up this case.

Also, I'm not sure isFpOrNEON is quite sufficient here; we probably also want to handle SVE accesses to fixed-width slots.

Hmmm, I'll look into this, these checks are mostly the same as the for the current hazard padding option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that emits the stores/loads should have set the stack ID to ScalablePredicateVector to start with, instead of to ScalableVector.

@MacDue
Copy link
Member Author

MacDue commented Jun 4, 2025

If you express the size of the hazard padding between the PPRs and ZPRs as a scalable size, that might simplify some of the logic? You wouldn't need to represent the two areas as separate stacks, at least.

It would, but for the sizes of hazard padding and vscale we're interested in, it would result in a much larger allocation than necessary and likely complicate addressing predicates and vectors more so (due to limited ranges for scalable offsets).

Maybe it's also worth considering "preallocating" some stack slots, along the lines of LocalStackSlotAllocation; anything we can allocate in an earlier pass doesn't have to be part of the core stack layout, and we probably get better code if we have a "base" for ZPR variables.

I'll take a look at LocalStackSlotAllocation, I've not worked on that pass before. From the name though, I believe we'll still need to do something for the CSRs (which are in the same area). Having a base pointer for ZPRs is something we intend to add as it would improve addressing (though I initially considered that as an extension for a later patch).

Do we care at all how stack protectors fit into all of this? I suspect the stack protector ends up in an unfortunate position by default.

I don't think it's that important. Given the stack protector can end up in the scalable area it will result in a hazard if enabled (with this change or the previous hazard padding patch), but I don't think it's something likely to be enabled for library kernels.

@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from 42af819 to 55bd461 Compare June 25, 2025 11:10
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from b59cae4 to 7782e44 Compare June 25, 2025 11:11
@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from 55bd461 to 8d71ee3 Compare June 26, 2025 09:17
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from 7782e44 to 0bcda41 Compare June 26, 2025 09:19
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from 0bcda41 to 86fe450 Compare July 8, 2025 10:45
@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from c261b5f to 1f4858f Compare July 9, 2025 12:31
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch 2 times, most recently from 9131d7d to 18e44c2 Compare July 10, 2025 16:09
@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from 2c4b693 to f8494a5 Compare September 8, 2025 10:16
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from 3ad50b0 to f736354 Compare September 8, 2025 10:17
@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from f8494a5 to c89f1a7 Compare September 10, 2025 14:28
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from f736354 to f657276 Compare September 10, 2025 14:29
@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from c89f1a7 to fbf514f Compare September 11, 2025 09:24
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from ced0f5a to 98300ea Compare October 1, 2025 12:42
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some minor nits and request for some more tests, but otherwise it looks good to me.

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc < %s -mtriple=aarch64 -mattr=+sve2 -aarch64-stack-hazard-in-non-streaming -aarch64-split-sve-objects -aarch64-streaming-hazard-size=1024 | FileCheck %s
; RUN: llc < %s -mtriple=aarch64 -mattr=+sve2 -aarch64-stack-hazard-in-non-streaming -aarch64-split-sve-objects -aarch64-streaming-hazard-size=1024 -pass-remarks-analysis=stack-frame-layout 2>&1 >/dev/null | FileCheck %s --check-prefixes=CHECK-FRAMELAYOUT

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test with realignment (e.g. using an alloca with alignment > 16), and also a test with stack probing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests 👍

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MacDue, LGTM!

MacDue added a commit that referenced this pull request Oct 2, 2025
…s (NFCI)

Requested in #142392 (comment)

Change-Id: I842faddea1bd54c5e30a9985782baf5dce37e5bb
@MacDue MacDue force-pushed the users/MacDue/prepare_split_p2 branch from 271ed51 to 7592cab Compare October 2, 2025 16:18
MacDue added a commit that referenced this pull request Oct 2, 2025
Base automatically changed from users/MacDue/prepare_split_p2 to main October 2, 2025 16:58
MacDue added 10 commits October 2, 2025 17:01
…s (NFCI)

Requested in #142392 (comment)

Change-Id: I842faddea1bd54c5e30a9985782baf5dce37e5bb
Change-Id: I2c0e8290e4c224045aa12031cd966bc66bd84ac4
For a while we have supported the `-aarch64-stack-hazard-size=<size>`
option, which adds "hazard padding" between GPRs and FPR/ZPRs. However,
there is currently a hole in this mitigation as PPR and FPR/ZPR accesses
to the same area also cause streaming memory hazards (this is noted by
`-pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>`),
and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|   <hazard padding>                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that
hazard padding is placed between the PPRs and ZPRs rather than within
the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   <hazard padding>                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

This layout is only enabled if:

 * SplitSVEObjects are enabled (`-aarch64-split-sve-objects`)
   - (This may be enabled by default in a later patch)
 * Streaming memory hazards are present
   - (`-aarch64-stack-hazard-size=<val>` != 0)
 * PPRs and FPRs/ZPRs are on the stack
 * There's no stack realignment or variable-sized objects
   - This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to
ZPRs. This is to prevent stack hazards between FPRs and GRPs in the
fixed size callee-save area (which would otherwise require more hazard
padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation,
and is not intended change codegen for non-SME code.

Change-Id: I2e1906577c2ac79c40bc69e7c15e3ef09857445f
Change-Id: I64f503597717441461ede15dc50709f9af498288
Change-Id: If04156b0ef35e89a517d8bc06f121d2555ee9379
Change-Id: I58542d83ae47b292ba3fa15f47ca0b9cb5a16141
Change-Id: I2703f109fda15a437bd918fbd94a97566f26e161
Change-Id: I44aaa861c82fa44fe4c762366572a7367e8bf5c0
Change-Id: I048bf702e3a507275bb503217e7b91b2d0d67498
Change-Id: I0479ebfbfd33a96bb7b50ca6c7b691f594383053
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from 8c53630 to d016e65 Compare October 2, 2025 17:03
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Oct 2, 2025
@MacDue MacDue merged commit 8f67cdd into main Oct 2, 2025
7 of 8 checks passed
@MacDue MacDue deleted the users/MacDue/split_pprs branch October 2, 2025 18:05
MacDue added a commit to MacDue/llvm-project that referenced this pull request Oct 2, 2025
This enables `aarch64-split-sve-objects` by default. Note: This option
only has an effect when used in conjunction with hazard padding
(`aarch64-stack-hazard-size` != 0`).

See llvm#142392 for more details.

Change-Id: If4e79358e02d8ae178fd3a8bd393833d64c0e00c
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Oct 3, 2025
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Oct 3, 2025
For a while we have supported the `-aarch64-stack-hazard-size=<size>`
option, which adds "hazard padding" between GPRs and FPR/ZPRs. However,
there is currently a hole in this mitigation as PPR and FPR/ZPR accesses
to the same area also cause streaming memory hazards (this is noted by
`-pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>`),
and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

```
------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|   <hazard padding>                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address
```

With this patch the stack (and hazard padding) is rearranged so that
hazard padding is placed between the PPRs and ZPRs rather than within
the (fixed size) callee-save region. Which looks something like this:

```
------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   <hazard padding>                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address
```

This layout is only enabled if:

 * SplitSVEObjects are enabled (`-aarch64-split-sve-objects`)
   - (This may be enabled by default in a later patch)
 * Streaming memory hazards are present
   - (`-aarch64-stack-hazard-size=<val>` != 0)
 * PPRs and FPRs/ZPRs are on the stack
 * There's no stack realignment or variable-sized objects
   - This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to
ZPRs. This is to prevent stack hazards between FPRs and GRPs in the
fixed size callee-save area (which would otherwise require more hazard
padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation,
and is not intended change codegen for non-SME code.
MacDue added a commit that referenced this pull request Oct 3, 2025
…#161714)

This enables `aarch64-split-sve-objects` by default. Note: This option
only has an effect when used in conjunction with hazard padding
(`aarch64-stack-hazard-size` != 0).

See #142392 for more details.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Oct 3, 2025
…ard padding (#161714)

This enables `aarch64-split-sve-objects` by default. Note: This option
only has an effect when used in conjunction with hazard padding
(`aarch64-stack-hazard-size` != 0).

See llvm/llvm-project#142392 for more details.
MixedMatched pushed a commit to MixedMatched/llvm-project that referenced this pull request Oct 3, 2025
…llvm#161714)

This enables `aarch64-split-sve-objects` by default. Note: This option
only has an effect when used in conjunction with hazard padding
(`aarch64-stack-hazard-size` != 0).

See llvm#142392 for more details.
MacDue added a commit to MacDue/llvm-project that referenced this pull request Oct 16, 2025
This was left out of the original patch (llvm#142392) to simplify the initial
implementation. However, after refactoring the SVE prologue/epilogue code
in llvm#162253, it's not much of an extension to support this case.

The main change here is when restoring the SP from the FP for the SVE
restores, we may need an additional frame offset to move from the start
of the ZPR callee-saves to the start of the PPR callee-saves.

This patch also fixes a previously latent bug where we'd add the
`RealignmentPadding` when allocating the PPR locals, then again for the
ZPR locals. This was unnecessary as the stack only needs to be realigned
after all SVE allocations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants