Fix/aarch64 memset dup optimization #166030

osamakader · 2025-11-01T23:27:49Z

Fixes #165949

llvmbot · 2025-11-01T23:28:20Z

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-aarch64

Author: Osama Abdelkader (osamakader)

Changes

Fixes #165949

Full diff: https://github.com/llvm/llvm-project/pull/166030.diff

3 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+11)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+15-10)
(modified) llvm/test/CodeGen/AArch64/memset-inline.ll (+55-31)

diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 379242ec5a157..d052d1a8b3c2d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -8543,6 +8543,17 @@ static SDValue getMemsetValue(SDValue Value, EVT VT, SelectionDAG &DAG,
   if (!IntVT.isInteger())
     IntVT = EVT::getIntegerVT(*DAG.getContext(), IntVT.getSizeInBits());
 
+  // For repeated-byte patterns, generate a vector splat instead of MUL to enable
+  // efficient lowering to DUP on targets like AArch64.
+  if (NumBits > 8 && VT.isInteger() && !VT.isVector() && 
+      (NumBits == 32 || NumBits == 64)) {
+    // Generate a vector of bytes: v4i8 for i32, v8i8 for i64
+    EVT ByteVecTy = EVT::getVectorVT(*DAG.getContext(), MVT::i8, NumBits / 8);
+    SDValue VecSplat = DAG.getSplatBuildVector(ByteVecTy, dl, Value);
+    // Bitcast back to the target integer type
+    return DAG.getNode(ISD::BITCAST, dl, IntVT, VecSplat);
+  }
+
   Value = DAG.getNode(ISD::ZERO_EXTEND, dl, IntVT, Value);
   if (NumBits > 8) {
     // Use a multiplication with 0x010101... to extend the input to the
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 60aa61e993b26..f9e5a706bd1de 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -18328,10 +18328,11 @@ EVT AArch64TargetLowering::getOptimalMemOpType(
   bool CanImplicitFloat = !FuncAttributes.hasFnAttr(Attribute::NoImplicitFloat);
   bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;
   bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;
-  // Only use AdvSIMD to implement memset of 32-byte and above. It would have
+  // For zero memset, only use AdvSIMD for 32-byte and above. It would have
   // taken one instruction to materialize the v2i64 zero and one store (with
   // restrictive addressing mode). Just do i64 stores.
-  bool IsSmallMemset = Op.isMemset() && Op.size() < 32;
+  // For non-zero memset, use NEON even for smaller sizes as dup is efficient.
+  bool IsSmallZeroMemset = Op.isMemset() && Op.size() < 32 && Op.isZeroMemset();
   auto AlignmentIsAcceptable = [&](EVT VT, Align AlignCheck) {
     if (Op.isAligned(AlignCheck))
       return true;
@@ -18341,10 +18342,11 @@ EVT AArch64TargetLowering::getOptimalMemOpType(
            Fast;
   };
 
-  if (CanUseNEON && Op.isMemset() && !IsSmallMemset &&
-      AlignmentIsAcceptable(MVT::v16i8, Align(16)))
+  // For non-zero memset, use NEON even for smaller sizes as dup + scalar store is efficient
+  if (CanUseNEON && Op.isMemset() && !IsSmallZeroMemset)
     return MVT::v16i8;
-  if (CanUseFP && !IsSmallMemset && AlignmentIsAcceptable(MVT::f128, Align(16)))
+  if (CanUseFP && !IsSmallZeroMemset &&
+      AlignmentIsAcceptable(MVT::f128, Align(16)))
     return MVT::f128;
   if (Op.size() >= 8 && AlignmentIsAcceptable(MVT::i64, Align(8)))
     return MVT::i64;
@@ -18358,10 +18360,11 @@ LLT AArch64TargetLowering::getOptimalMemOpLLT(
   bool CanImplicitFloat = !FuncAttributes.hasFnAttr(Attribute::NoImplicitFloat);
   bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;
   bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;
-  // Only use AdvSIMD to implement memset of 32-byte and above. It would have
+  // For zero memset, only use AdvSIMD for 32-byte and above. It would have
   // taken one instruction to materialize the v2i64 zero and one store (with
   // restrictive addressing mode). Just do i64 stores.
-  bool IsSmallMemset = Op.isMemset() && Op.size() < 32;
+  // For non-zero memset, use NEON even for smaller sizes as dup is efficient.
+  bool IsSmallZeroMemset = Op.isMemset() && Op.size() < 32 && Op.isZeroMemset();
   auto AlignmentIsAcceptable = [&](EVT VT, Align AlignCheck) {
     if (Op.isAligned(AlignCheck))
       return true;
@@ -18371,10 +18374,12 @@ LLT AArch64TargetLowering::getOptimalMemOpLLT(
            Fast;
   };
 
-  if (CanUseNEON && Op.isMemset() && !IsSmallMemset &&
-      AlignmentIsAcceptable(MVT::v2i64, Align(16)))
+  // For non-zero memset, use NEON for all sizes where it's beneficial.
+  // NEON dup + scalar store works for any alignment and is efficient.
+  if (CanUseNEON && Op.isMemset() && !IsSmallZeroMemset)
     return LLT::fixed_vector(2, 64);
-  if (CanUseFP && !IsSmallMemset && AlignmentIsAcceptable(MVT::f128, Align(16)))
+  if (CanUseFP && !IsSmallZeroMemset &&
+      AlignmentIsAcceptable(MVT::f128, Align(16)))
     return LLT::scalar(128);
   if (Op.size() >= 8 && AlignmentIsAcceptable(MVT::i64, Align(8)))
     return LLT::scalar(64);
diff --git a/llvm/test/CodeGen/AArch64/memset-inline.ll b/llvm/test/CodeGen/AArch64/memset-inline.ll
index 02d852b5ce45a..ed9a752dc1f8d 100644
--- a/llvm/test/CodeGen/AArch64/memset-inline.ll
+++ b/llvm/test/CodeGen/AArch64/memset-inline.ll
@@ -27,39 +27,57 @@ define void @memset_2(ptr %a, i8 %value) nounwind {
 }
 
 define void @memset_4(ptr %a, i8 %value) nounwind {
-; ALL-LABEL: memset_4:
-; ALL:       // %bb.0:
-; ALL-NEXT:    mov w8, #16843009
-; ALL-NEXT:    and w9, w1, #0xff
-; ALL-NEXT:    mul w8, w9, w8
-; ALL-NEXT:    str w8, [x0]
-; ALL-NEXT:    ret
+; GPR-LABEL: memset_4:
+; GPR:       // %bb.0:
+; GPR-NEXT:    mov w8, #16843009
+; GPR-NEXT:    and w9, w1, #0xff
+; GPR-NEXT:    mul w8, w9, w8
+; GPR-NEXT:    str w8, [x0]
+; GPR-NEXT:    ret
+;
+; NEON-LABEL: memset_4:
+; NEON:       // %bb.0:
+; NEON-NEXT:    dup v0.8b, w1
+; NEON-NEXT:    str s0, [x0]
+; NEON-NEXT:    ret
   tail call void @llvm.memset.inline.p0.i64(ptr %a, i8 %value, i64 4, i1 0)
   ret void
 }
 
 define void @memset_8(ptr %a, i8 %value) nounwind {
-; ALL-LABEL: memset_8:
-; ALL:       // %bb.0:
-; ALL-NEXT:    // kill: def $w1 killed $w1 def $x1
-; ALL-NEXT:    mov x8, #72340172838076673
-; ALL-NEXT:    and x9, x1, #0xff
-; ALL-NEXT:    mul x8, x9, x8
-; ALL-NEXT:    str x8, [x0]
-; ALL-NEXT:    ret
+; GPR-LABEL: memset_8:
+; GPR:       // %bb.0:
+; GPR-NEXT:    // kill: def $w1 killed $w1 def $x1
+; GPR-NEXT:    mov x8, #72340172838076673
+; GPR-NEXT:    and x9, x1, #0xff
+; GPR-NEXT:    mul x8, x9, x8
+; GPR-NEXT:    str x8, [x0]
+; GPR-NEXT:    ret
+;
+; NEON-LABEL: memset_8:
+; NEON:       // %bb.0:
+; NEON-NEXT:    dup v0.8b, w1
+; NEON-NEXT:    str d0, [x0]
+; NEON-NEXT:    ret
   tail call void @llvm.memset.inline.p0.i64(ptr %a, i8 %value, i64 8, i1 0)
   ret void
 }
 
 define void @memset_16(ptr %a, i8 %value) nounwind {
-; ALL-LABEL: memset_16:
-; ALL:       // %bb.0:
-; ALL-NEXT:    // kill: def $w1 killed $w1 def $x1
-; ALL-NEXT:    mov x8, #72340172838076673
-; ALL-NEXT:    and x9, x1, #0xff
-; ALL-NEXT:    mul x8, x9, x8
-; ALL-NEXT:    stp x8, x8, [x0]
-; ALL-NEXT:    ret
+; GPR-LABEL: memset_16:
+; GPR:       // %bb.0:
+; GPR-NEXT:    // kill: def $w1 killed $w1 def $x1
+; GPR-NEXT:    mov x8, #72340172838076673
+; GPR-NEXT:    and x9, x1, #0xff
+; GPR-NEXT:    mul x8, x9, x8
+; GPR-NEXT:    stp x8, x8, [x0]
+; GPR-NEXT:    ret
+;
+; NEON-LABEL: memset_16:
+; NEON:       // %bb.0:
+; NEON-NEXT:    dup v0.16b, w1
+; NEON-NEXT:    str q0, [x0]
+; NEON-NEXT:    ret
   tail call void @llvm.memset.inline.p0.i64(ptr %a, i8 %value, i64 16, i1 0)
   ret void
 }
@@ -110,14 +128,20 @@ define void @memset_64(ptr %a, i8 %value) nounwind {
 ; /////////////////////////////////////////////////////////////////////////////
 
 define void @aligned_memset_16(ptr align 16 %a, i8 %value) nounwind {
-; ALL-LABEL: aligned_memset_16:
-; ALL:       // %bb.0:
-; ALL-NEXT:    // kill: def $w1 killed $w1 def $x1
-; ALL-NEXT:    mov x8, #72340172838076673
-; ALL-NEXT:    and x9, x1, #0xff
-; ALL-NEXT:    mul x8, x9, x8
-; ALL-NEXT:    stp x8, x8, [x0]
-; ALL-NEXT:    ret
+; GPR-LABEL: aligned_memset_16:
+; GPR:       // %bb.0:
+; GPR-NEXT:    // kill: def $w1 killed $w1 def $x1
+; GPR-NEXT:    mov x8, #72340172838076673
+; GPR-NEXT:    and x9, x1, #0xff
+; GPR-NEXT:    mul x8, x9, x8
+; GPR-NEXT:    stp x8, x8, [x0]
+; GPR-NEXT:    ret
+;
+; NEON-LABEL: aligned_memset_16:
+; NEON:       // %bb.0:
+; NEON-NEXT:    dup v0.16b, w1
+; NEON-NEXT:    str q0, [x0]
+; NEON-NEXT:    ret
   tail call void @llvm.memset.inline.p0.i64(ptr align 16 %a, i8 %value, i64 16, i1 0)
   ret void
 }

github-actions · 2025-11-01T23:29:28Z

✅ With the latest revision this PR passed the C/C++ code formatter.

This change improves memset code generation for non-zero values on AArch64 for sizes 4, 8, and 16 bytes by using NEON's DUP instruction instead of the less efficient multiplication with 0x01010101 pattern. Changes: 1. In SelectionDAG.cpp: For AArch64 targets, generate vector splats for scalar i32/i64 memset operations, which are then efficiently lowered to DUP instructions. 2. In AArch64ISelLowering.cpp: Modify getOptimalMemOpType and getOptimalMemOpLLT to return v16i8 for non-zero memset operations of any size when NEON is available (previously only for sizes >= 32 bytes). 3. Update test expectations to verify the new DUP-based code generation for both NEON and GPR code paths. The optimization is restricted to AArch64 only to avoid breaking RISCV and X86 tests. Signed-off-by: Osama Abdelkader <[email protected]>

llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well labels Nov 1, 2025

osamakader force-pushed the fix/aarch64-memset-dup-optimization branch 2 times, most recently from fe2092e to 4922ac4 Compare November 1, 2025 23:42

RKSimon requested review from davemgreen and sdesmalen-arm November 2, 2025 10:43

osamakader force-pushed the fix/aarch64-memset-dup-optimization branch from 4922ac4 to 1b4c329 Compare November 2, 2025 12:52

osamakader force-pushed the fix/aarch64-memset-dup-optimization branch from 1b4c329 to c9b595d Compare November 2, 2025 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/aarch64 memset dup optimization #166030

Fix/aarch64 memset dup optimization #166030

osamakader commented Nov 1, 2025

Uh oh!

llvmbot commented Nov 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/aarch64 memset dup optimization #166030

Are you sure you want to change the base?

Fix/aarch64 memset dup optimization #166030

Conversation

osamakader commented Nov 1, 2025

Uh oh!

llvmbot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

llvmbot commented Nov 1, 2025 •

edited

Loading

github-actions bot commented Nov 1, 2025 •

edited

Loading